I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?
A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")
I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.
Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)
Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")
Related
I want to extract a bunch of numbers from a website https://www.bcassessment.ca/ using PID (unique ID). The list of sample PID are shown below:
PID <- c("012-215-023", "024-521-647", "025-891-669")
For these values, I opened the website manually and in the search engine of the website, I chose PID from the list of available options and then searched these numbers. The search redirected me to the following URLs
URL <- c("https://www.bcassessment.ca//Property/Info/QTAwMDAwM1hIUA==",
"https://www.bcassessment.ca//Property/Info/QTAwMDAwNEJKMA==",
"https://www.bcassessment.ca//Property/Info/QTAwMDAwMUc5OA==")
Then for each of these URLs, I ran the code shown below, to extract the total value of the property:
out <- c()
for (i in 1: length(URL)) {
url <- URL[i]
out[i] <- url %>%
read_html %>%
html_nodes('span#lblTotalAssessedValue') %>%
html_text()
i <- i+1
}
which gives me the final result
[1] "$543,000" "$957,000" "$487,000"
The problem is that I have a list of PID (more than 50000) and I cannot manually search each of these PIDs in the website to find the actual link and then run rvest to scrape it. How do you recommend automating this process so I can only provide PIDs and get the output price?
Summary: for a list of known PID I want to open https://www.bcassessment.ca/ and extract the most up-to-date price of the property, and I want it to be done Automatically.
Test_PID
I added list of PID code, so you can check if you want to check the code is working:
structure(list(P.I.D.. = c("004-050-541", "016-658-540", "016-657-861",
"016-657-764", "019-048-386", "025-528-360", "800-058-036", "025-728-954",
"028-445-783", "027-178-048", "028-445-571", "025-205-145", "015-752-798",
"026-041-308", "024-521-698", "027-541-631", "024-360-651", "028-445-040",
"025-851-411", "025-529-293", "024-138-436", "023-893-796", "018-496-768",
"025-758-721", "024-219-665", "024-359-866", "018-511-015", "026-724-979",
"023-894-253", "006-331-505", "025-961-012", "024-219-690", "027-309-878",
"028-445-716", "025-759-060", "017-692-733", "025-728-237", "028-447-221",
"023-894-202", "028-446-020", "026-827-611", "028-058-798", "017-574-412",
"023-893-591", "018-511-457", "025-960-199", "027-178-714", "027-674-941",
"027-874-826", "025-110-390", "028-071-336", "018-257-984", "023-923-393",
"026-367-203", "027-601-854", "003-773-922", "025-902-989", "018-060-641",
"025-530-003", "018-060-722", "025-960-423", "016-160-126", "009-301-461",
"025-960-580", "019-090-315", "023-464-283", "028-445-503", "006-395-708",
"028-446-674", "018-258-549", "023-247-398", "029-321-166", "024-519-871",
"023-154-161", "003-904-547", "004-640-357", "006-314-864", "025-960-521",
"013-326-783", "003-430-049", "027-490-084", "024-360-392", "028-054-474",
"026-076-179", "005-309-689", "024-613-509", "025-978-551", "012-215-066",
"024-034-002", "025-847-244", "024-222-038", "003-912-019", "024-845-264",
"006-186-254", "026-826-691", "026-826-712", "024-575-569", "028-572-581",
"026-197-774", "009-695-958", "016-089-120", "025-703-811", "024-576-671",
"026-460-751", "026-460-149", "003-794-181", "018-378-684", "023-916-745",
"003-497-721", "003-397-599", "024-982-211", "018-060-129", "018-061-231",
"017-765-714", "027-303-799", "028-565-312", "018-061-010", "006-338-232",
"023-680-024", "028-983-971", "028-092-490", "006-293-239", "018-061-257",
"028-092-376", "018-060-137", "004-302-664", "016-988-060", "003-371-166",
"027-325-342", "011-475-480", "018-060-200")), row.names = c(NA,
-131L), class = c("tbl_df", "tbl", "data.frame"))
P.S. The website I mentioned is a public website and everyone can open it and add an address to find the estimated price of a property, so I don't think there is any problem with scraping it as it's a public database.
When you submit the pid through the form, it triggers the following call:
GET https://www.bcassessment.ca/Property/Search/GetByPid/012215023?PID=012215023&_=1619713418473
The call above has the following parameters:
012215023 is the PID without dash - in your input. It's both a path and query parameter
1619713418473 is the current timestamp in milliseconds since 1970 (unix timestamp)
The result of the call above is a json response like this:
{
"sEcho": 1,
"aaData": [
["XXXXXXX", "XXXXXXXX", "XXXXXXXXXXXX", "200-027-615-115-48-0004", "QTAwMDAwM1hIUA=="]
]
}
The above call returns the response as text/plain and not as application/json content type, so we have to parse it using jsonlite. Then pick the last item of aaData array value which is, in this case: QTAwMDAwM1hIUA== and build the resulted url like the one in your post.
The following code gets a list of PID and extracts the $ values for each one of these:
library(rvest)
getValueForPID <- function(pid) {
pidNum = gsub("-", "", pid)
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(httr::GET(paste0("https://www.bcassessment.ca/Property/Search/GetByPid/",pidNum), query = list(
"PID" = pidNum,
"_" = format(time, digits=13)
)), "text", encoding = "UTF-8")
if(output == "found_no_results"){
return("")
}
data = jsonlite::fromJSON(output)
id = data$aaData[5]
text <- paste0("https://www.bcassessment.ca/Property/Info/", id) %>%
read_html %>%
html_nodes('span#lblTotalAssessedValue') %>%
html_text()
return(text)
}
PID <- c("004-050-541", "016-658-540", "016-657-861", "016-657-764", "019-048-386", "025-528-360", "800-058-036")
out <- c()
count <- 1
for (i in PID) {
print(i)
out[count] <- getValueForPID(i)
count <- count + 1
}
print(out)
sample output:
[1] "$543,000" "$957,000" "$487,000"
kaggle link: https://www.kaggle.com/bertrandmartel/bcassesment-pid
I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.
I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt> and </tt> but there are many of these. I'm also not sure how to then get that value into R. Thank you.
library(xml2)
for (word in df$word) {
result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}
Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract() on the HTML.
For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt> tags but the one you want is always in the second instance so you can just pull out that one.
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}
>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."
Good luck!
EDIT: This code will return a dataframe:
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)
#loop over the words
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))
#bind the data
pronunciation_list<- rbind(pronunciation_list, to_add)
#don't abuse your API
Sys.sleep(0.1)
}
I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?
A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")
I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.
Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)
Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")
The following code works for individual .html files:
doc <- htmlParse("New folder/1-4.html")
plain.text <- xpathSApply(doc, "//td", xmlValue)
plain.text <- gsub("\n", "", plain.text)
gregexpr("firstThing", plain.text)
firstThing <- substring(plain.text[9], 41, 50)
gregexpr(secondThing, plain.text)
secondThing <- substring(plain.text[7], 1, 550)
But the following loop does not and gives me the error:
XML content does not seem to be XML
file.names <- dir(path = "New folder")
for(i in 1:length(file.names)){
doc <- htmlParse(file.names[i])
plain.text <- xpathSApply(doc, "//td", xmlValue)
gsub("\n", "", plain.text)
firstThing[i] <- substring(plain.text[9], 41, 50)
secondThing[i] <- substring(plain.text[7], 1, 550)
}
I'm simply trying to extract the information (as I've been able to do in the first batch of code), and create a vector of information.
Any ideas on how to resolve this issue?
Two things. First, your paths were wrong. To fix this, use:
filenames = dir(path = "New folder", full.names = TRUE)
Secondly, a better way than filling two variables inside a for loop is to generate structured data in a list function:
result = lapply(filenames, function (filename) {
doc = htmlParse(filename)
plain_text = xpathSApply(doc, "//td", xmlValue)
c(first = substring(plain_text[9], 41, 50),
second = substring(plain_text[7], 1, 550))
})
Now result is a list of elements, where each element is a vector with names first and second.
A few other remarks:
Be wary of dots in variable names — S3 uses dots in names to determine the class of a generic method. Using dots for anything else in variable names causes confusion and should be avoided.
The gsub statement in your loop has no effect.