What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert
"ü"
to
"ü"
I am aware that this is possible by a clever use of gsub (but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.
EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:
char2html <- function(x){
dictionary <- data.frame(
symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
html = c("ä","ö", "ü","Ä",
"Ö", "Ü","ß"))
for(i in 1:dim(dictionary)[1]){
x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
}
x
}
x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...
conv_latinsupp <- function(...) {
out <- character()
for (s in list(...)) {
splitted <- unlist(strsplit(s, ""))
intvalues <- utf8ToInt(enc2utf8(s))
pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
splitted[pos_to_modify] <- paste0("�", intvalues[pos_to_modify], ";")
out <- c(out, paste0(splitted, collapse = ""))
}
out
}
conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou" "àéïôù12345"
The XML uses a method insertEntities for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.
Right now, your code could be accomplished using
char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))
The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities calls gsub in much the same way your code does.
If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.
Related
I am trying to get a JSON response from an API:
test <- GET(url, add_headers(`api_key` = key))
content(test, 'parsed')
When I run content(test, 'parsed'), I get the following error:
# Error: lexical error: invalid string in json text. .Note: Final passage of the "fiscal cliff bill" on January 1
I think this is because of the double quotations. How can I either replace the double quotes or if this is not the problem, how can I fix this issue?
Thanks!
So I had run into a similar problem before, and I had intended to write a quite function to use Jeroen's fix to try to repair the JSON. Since I intended to do it anyway, here's a quick hack attempt.
NB: repairing a structured format like this is speculative at best and most certainly prone to errors. The good news is that I tried to keep this specific enough so that it will not produce false results: it'll either fix what it knows it can, or fail. The "unit-testing" really needs to check other corner-cases. If you find something that this does not fix (and should) or that this breaks (gasp!), please comment!
fix_json_quotes <- function(s) {
if (length(s) != 1) {
warning("the argument has length > 1 and only the first element will be used")
s <- s[[1]]
}
stopifnot(is.character(s))
val <- jsonlite::validate(s)
while (! val) {
ind <- attr(val, "offset") - 1
snew <- gsub("(.*)(['\"])([[:space:],]*)$", "\\1\\\\\\2\\3", substr(s, 1, ind))
if (snew != substr(s, 1, ind)) {
s <- paste0(snew, substr(s, ind + 1, nchar(s)))
} else {
break
}
val <- jsonlite::validate(s)
}
if (! val) {
# still not validating
stop("unable to fix quotes")
}
return(s)
}
Some sample data, unit-testing if you will (testthat is not required for use of the function):
library(testthat)
lst <- list(a="final \"cliff bill\" on")
json <- as.character(toJSON(lst))
json
# [1] "{\"a\":[\"final \\\"cliff bill\\\" on\"]}"
Okay, there should be no change:
expect_equal(json, fix_json_quotes(json))
Some bad data:
# un-escape the double quotes
badlst <- "{\"a\":[\"final \"cliff bill\" on\"]}"
expect_error(jsonlite::fromJSON(badlst))
expect_equal(json, fix_json_quotes(badlst))
PS: this looks specifically for double-quotes, nothing more. However, I believe that there are related errors that this might also be able to fix. I "left room" for this, in the second group within the regex (([\"])); for example, if single-quotes could also cause a problem, then the group could be changed to be ([\"']). I don't know if it's useful or even necessary.
I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.
i want to retrieve the names of product from the website, so i write my code below. but the result includes some trivial info such as \n\t\t\t. Can someone help me how to delete these stuff?
code:
retrieve name
reddoturl <- 'http://red-dot.de/pd/online-exhibition/?lang=en&c=163&a=0&y=2013&i=0&oes='
library(XML)
doc <- htmlParse(reddoturl)
review data
reviews<-xpathSApply(doc,'//div[#class="work_contaienterner_headline"]',xmlValue)
results:
[1] "VZ-C6 / VZ-C3D\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tDocument Camera\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
I worry a bit about removing all tabs but this would do it:
> reviews <- "VZ-C6 / VZ-C3D\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tDocument Camera\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
> reviews <- gsub( "\\\t", "", reviews)
> reviews
[1] "VZ-C6 / VZ-C3D\n\nDocument Camera\n\n"
Read ?regex and understand that there are extra backslashes needed because both R and regex use "\" as escapes and so there are two levels of character parsing on the way to a pattern. That's not the case in the replacement argument though so you don't need to used doubled escapes there. So if you then wanted to replace those "\n\n"'s with just one "\n" you could use:
> reviews <- gsub( "\\\n\\\n", "\n", reviews)
> reviews
[1] "VZ-C6 / VZ-C3D\nDocument Camera\n"
The go-to function for "find and replace" operations on strings in R are sub (to replace just the first instance) and gsub (to replace all instances). These functions seek a pattern in the string represented by a regular expression, and replace it by a fixed string of text.
For example:
s <- "VZ-C6 / VZ-C3D\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tDocument Camera\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
gsub('\t|\n', '', s)
[1] "VZ-C6 / VZ-C3DDocument Camera"
The pipe operator (|) in the the pattern above, \t|\n, ensures that either \n or \t are matched, and the second argument of '' says to replace matches with an empty string (i.e. nothing).
While s above contains just a single element, gsub and sub are vectorised and so will also work on an entire vector of arbitrary length.
This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.
There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.
In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].
If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.
This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.
Hi #Sim I had cause to reflect on your problem yesterday define:
flatten<-function(x) {
dumnames<-unlist(getnames(x,T))
dumnames<-gsub("(*.)\\.1","\\1",dumnames)
repeat {
x <- do.call(.Primitive("c"), x)
if(!any(vapply(x, is.list, logical(1)))){
names(x)<-dumnames
return(x)
}
}
}
getnames<-function(x,recursive){
nametree <- function(x, parent_name, depth) {
if (length(x) == 0)
return(character(0))
x_names <- names(x)
if (is.null(x_names)){
x_names <- seq_along(x)
x_names <- paste(parent_name, x_names, sep = "")
}else{
x_names[x_names==""] <- seq_along(x)[x_names==""]
x_names <- paste(parent_name, x_names, sep = "")
}
if (!is.list(x) || (!recursive && depth >= 1L))
return(x_names)
x_names <- paste(x_names, ".", sep = "")
lapply(seq_len(length(x)), function(i) nametree(x[[i]],
x_names[i], depth + 1L))
}
nametree(x, "", 0L)
}
(getnames is adapted from AnnotationDbi:::make.name.tree)
(flatten is adapted from discussion here How to flatten a list to a list without coercion?)
as a simple example
my_data<-list(x=list(1,list(1,2,y='e'),3))
> my_data[['x']][[2]][['y']]
[1] "e"
> out<-flatten(my_data)
> out
$x.1
[1] 1
$x.2.1
[1] 1
$x.2.2
[1] 2
$x.2.y
[1] "e"
$x.3
[1] 3
> out[['x.2.y']]
[1] "e"
so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.
A more complicated example
library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)
UPDATE
naive way to remove trailing .1
my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1" "x.2.1" "x.2.2" "x.2.y" "x.3"
R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.
You can then flatten that list (into a vector) using the unlist function.
If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.
An example:
(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)
The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.
Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.
library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)
I'm now a big fan of simply:
library(jsonlite)
library(tidyverse)
fromJSON("file_path.json") %>%
unlist() %>%
enframe()
And then potentially, depending on your data, piping that into
%>%
pivot_wider()
Once it's in a flat table shape, there are a load of tools in tidyverse and other R libraries more generally for wrangling things around and e.g., dealing with columns with similar prefixes (which will result from the above pipeline as the parent name of the children within a nested json chunk will be prefixed to the child's name).
I am trying to use \Sexpr{} to include values from my R objects in a LaTeX table. I am essentially trying to replicate the summary output of a lm object in R because xtable's built in methods xtable.lm and xtable.summary.lm don't seem to include the Fstats, adjusted R-squared, etc (all the stuff at the bottom of the summary printout of the lm object in R console) So I tried accomplishing this by building a matrix to replicate the xtable.summary.lm output then construct a data frame of the relevant info for the extra stuff so I can refer to the values using \Sexpr{}. I tried doing this by using add.to.row to append the \multicolumn{} command in order to merge all columns of the last row of the LaTeX table and then just pass all the information I need into that cell of the table.
The problem is that I get an "Undefined control sequence" for the \Sexpr{} expression in the \multicolumn{} expression. Are these two not compatible? If so, what am I doing wrong and if not does anyone know how to do what I am trying to do?
Thanks,
Here is the relevant part of my code:
<<Test, results=tex>>=
model1 <- lm(stndfnl ~ atndrte + frosh + soph)
# Build matrix to replicate xtable.summary.lm output
x <- summary(model1)
colnames <- c("Estimate", "Std. Error", "t value", "Pr(<|t|)")
rownames <- c("(Intercept)", attr(x$terms, "term.labels"))
fpval <- pf(x$fstatistic[1],x$fstatistic[2], x$fstatistic[3], lower.tail=FALSE)
mat1 <- matrix(coef(x), nrow=length(rownames), ncol=length(colnames), dimnames=list(rownames,colnames))
# Make a data frame for extra information to be called by \Sexpr in last row of table
residse <- x$sigma
degf <- x$df[2]
multr2 <- x$r.squared
adjr2 <- x$adj.r.squared
fstat <- x$fstatistic[1]
fstatdf1 <- x$fstatistic[2]
fstatdf2 <- x$fstatistic[3]
extradat <- data.frame(v1 = round(residse,4), v2 =degf, v3=round(multr2,4), v4=round(adjr2,4),v5=round(fstat,3), v6=fstatdf1, v7=fstatdf2, v8=round(fpval,6))
addtorow<- list()
addtorow$pos <-list()
addtorow$pos[[1]] <- dim(mat1)[1]
addtorow$command <-c('\\hline \\multicolumn{5}{l}{Residual standard error:\\Sexpr{extradat$v1}} \\\\ ')
print(xtable(mat1, caption="Summary Results for Regression in Equation \\eqref{model1} ", label="tab:model1"), add.to.row=addtorow, sanitize.text.function=NULL, caption.placement="top")
You don't need to have Sexpr in your R code; the R code can use the expressions directly. Sexpr is not a LaTeX command, even though it looks like one; it's an Sweave command, so it doesn't work to have it as output from R code.
Try
addtorow$command <-paste('\\hline \\multicolumn{5}{l}{Residual standard error:',
extradat$v1, '} \\\\ ')
Also, no need to completely recreate the matrix used by xtable, you can just build on the default output. Building on what you have above, something like:
mytab <- xtable(model1, caption="Summary Results", label="tab:model1")
addtorow$pos[[1]] <- dim(mytab)[1]
print(mytab, add.to.row=addtorow, sanitize.text.function=NULL,
caption.placement="top")
See http://people.su.se/~lundh/reproduce/sweaveintro.pdf for an example which you might be able to use as is.