Eval parse for a JSON - json

I am trying to automate a JSON parsing in R (I had to remove the "https:// from the URLs because I don't have enough reputation points):
library(Quandl)
library(jsonlite)
tmp <-
fromJSON("www.quandl.com/api/v3/datasets.json?database_code=WIKI&page=2",flatten = TRUE)
for various numbers in page=X. The above code snippet executes properly. For that I am trying to use eval(parse()) but I am doing something wrong. So I have the following:
text1 <- 'fromJSON("www.quandl.com/api/v3/datasets.json?database_code=WIKI&page='
text2 <- '",flatten = TRUE)'
and to verify that I create the string properly:
> text1
[1] "fromJSON(\www.quandl.com/api/v3/datasets.json?database_code=WIKI&page="
> text2
[1] "\",flatten = TRUE)"
> cat(text1,n,text2,sep="")
fromJSON("www.quandl.com/api/v3/datasets.json?database_code=WIKI&page=2",flatten = TRUE)
BUT when I try to execute:
koko <- eval(parse(text = cat(text1,n,text2,sep="")))
where n<-2 or any other integer then the console freezes with the following error messsage:
?
Error in parse(text = cat(text1, n, text2, sep = "")) :
<stdin>:1:4: unexpected '{'
1: D_{
^
What am I doing wrong here?

Have a read of the difference between paste and cat
cat will just print to the screen, it won't return anything. To create a string you should use paste or paste0.
For example, consider
concat <- cat(text1, n, text2)
p <- paste0(text1, n, text2)
Even when running concat <- cat(text1, n, text2), it prints the output to the console, and concat is empty/NULL
The solution is to use paste0 to create the string expression
text1 <- 'fromJSON("http://www.quandl.com/api/v3/datasets.json?database_code=WIKI&page='
text2 <- '",flatten = TRUE)'
n <- 2
koko <- eval(parse(text = (paste0(text1, n, text2))))
Also, you don't need to use eval, you can use paste0 directly
text1 <- 'http://www.quandl.com/api/v3/datasets.json?database_code=WIKI&page='
n <- 2
koko <- fromJSON(paste0(text1, n), flatten=TRUE)

Related

calling a function in R script

I wrote a R script in which I wrote a function and called the function. here is the whole script:
PrepData = function(infile){
data <- read.table(infile, header=TRUE, as.is = TRUE, sep = ",")
data = data[, 2:ncol(data)]
merged.data = data
colnames(merged.data[1]) < "CodeCount"
rownames(merged.data) <- merged.data$Name
x <- list(counts = merged.data, raw.counts = merged.data)
return(x)
}
data <- PrepData(myfile.csv)
data
but when I run it using the following command:
Rscript myscript.r
it gives this error:
Error in read.table(infile, header = TRUE, as.is = TRUE, sep = ",") :
object 'myfile.csv' not found
Calls: PrepData -> read.table
Execution halted
do you know how to fix it?
Try to change
data <- PrepData(myfile.csv)
To
data <- PrepData("myfile.csv")
You need to have quotation mark on the filename when you use read table function.

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

How to create html table including images with R (via relative path or inserted - base64)?

I found a solution to this in the past and still using it but I never looked for something else. I would like to share this one and get some feedback from the community about this and if better solution exists.
Actually my next idea on this is to implement selection tool a bit like editableCell. I actuallz could do it but the main raison to use this would be to use copy paste to paste the picture somewhere else as they are positioned. Unfortunately it does not copy the picture.
I have a list of jpeg with a specific name:
010003000Color3_0.jpg
010003000Color3_1.jpg
010003000Color3_2.jpg
I need to get the information from those names are create a grid of those pictures depending on the information contained.
E.g. here I have a name corresponding to _.jpg
Here the complete code, I let you go through it. Read the comments that are partly explaining it.
# Setup -------------------------------------------------------------------
library(data.table)
library(dplyr)
library(tableHTML)
library(knitr)
wd <- "C:/Users/gravier/Downloads/Example_Yann_Html/Jpeg"
# Function ----------------------------------------------------------------
addimgbalise <- function(vect, pisize=200) {
# to add a html balise with a pixel size for the image
vect <- paste0("<img src='", vect,"' width=", pisize, "/>")
return(vect)
}
write.html.link <- function(data,
filename,
caption = filename,
wdfunction,
color.bg = "#00000",
color.line = "#b4bac4",
color.text = "#b4bac4",
font.size = "8px",
font.family = "Arial",
text.align = "center") {
# permit to write a html table with a local link to click
tabhtml <- tableHTML(data,
rownames = FALSE,
caption = caption,
theme="default") %>%
add_css_table(css = list('border', color.line)) %>%
add_css_table(css = list('text-align', text.align)) %>%
add_css_table(css = list('font-family', font.family)) %>%
add_css_table(css = list('font-size', font.size)) %>%
add_css_table(css = list('color', color.text))
write_tableHTML(tabhtml, file = paste0(wdfunction, "/", filename, ".html"))
# unfortunately write_tableHTML has been changed in the past and is not write the '<' '>' characters of the addimgbalise function, so we have to read again the html and exchange those characters again
temp <- suppressWarnings(readLines(paste0(wdfunction, "/", filename, ".html")))
temp <- gsub( ">", ">", temp)
temp <- gsub( "<", "<", temp)
temp <- c("<body bgcolor='", color.bg, "'>", temp)
writeLines(temp, paste0(wdfunction, "/", filename, ".html"))
}
# Parameters (where in function normally, but I detail the process --------
patternlist <- ".jpg"
regexjpeg <- "([0-9]*)(.*)_(.*)" # this the regex expression regarding the name of the files
regexposID <- "\\1" # position of the different variable I want to extract from the name
regexposWhat <- "\\2"
regexposField <- "\\3"
regexposZ <- "\\3"
formulaPV <- "row+Field~col+What" # then the dcast formula regarding how to arrange the pictures
pixelimg <- 150 # size in html but the picture are kept of real resolution so you can zoom
base64 <- F
listimg<- data.table(path = list.files(wd, full.names = T))
listimg[, ID := gsub(paste0(regexjpeg, patternlist), regexposID, basename(path))]
listimg[, What := gsub(paste0(regexjpeg, patternlist), regexposWhat, basename(path))]
listimg[, Field := as.numeric(gsub(paste0(regexjpeg,patternlist), regexposField, basename(path)))]
listimg[, Z := as.numeric(gsub(paste0(regexjpeg,patternlist), regexposZ, basename(path)))]
listimg[, row := LETTERS[as.numeric(substr(ID, 1, 3))]]
listimg[, col := as.numeric(substr(ID, 4, 6))]
if( base64 ) {
for(i in 1:nrow(listimg) ) {
listimg[i, code := paste0(addimgbalise(image_uri(path), pixelimg), "\n", row, col, "-", Field, "-", What)]
}
} else {
listimg[, code := paste0(addimgbalise(path, pixelimg), "\n", row, col, "-", Field, "-", What)]
}
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
# Some other example ------------------------------------------------------
formulaPV <- "row+Field+What~col" # then the dcast formula regarding how to arrange the pictures
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid_other_formula", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
# base64
for(i in 1:nrow(listimg) ) {
listimg[i, code := paste0(addimgbalise(image_uri(path), pixelimg), "\n", row, col, "-", Field, "-", What)]
}
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid_base64", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
The final result is like this:

Avoid getting "glued" words with R webscraping

When I use both of the two following blocks of code code, I get "glued" words, and by that i mean words that are not not separated by a space but they should, and this is a problem. In the original HTML, it seem like they're separated by a <b> and i'm not beeing able to handle this. The two blocks do the same thing by different ways.
library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,
asText=TRUE,
useInternalNodes = TRUE,
encoding = "utf-8")
song <- getNodeSet(doc=h, path="//article", fun=xmlValue)
# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)
Which returns something like:
[1] "Sometimes I feelLike I don't have a partnerSometimes I feelLike my only friendIs the city I live inThe city of angelsLonely as I amTogether we cryI drive on her streets'Cause she's my companionI walk through her hills'Cause she knows who I amShe sees my good deedsAnd she kisses me windyI never worryNow that is a lieI don't ever wanna feelLike I did that dayBut take me to the place I loveTake me all the wayIt's hard to believeThat there's nobody out thereIt's hard to believeThat I'm all aloneAt...
I was able to retrieve words with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE)
page_Content <- remDr$getPageSource()[[1]]
list_Text_Song <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[#id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}
list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]
The words are not glued with this approach.

How to read nested JSON structure?

I have some JSON that looks like this:
"total_rows":141,"offset":0,"rows":[
{"id":"1","key":"a","value":{"SP$Sale_Price":"240000","CONTRACTDATE$Contract_Date":"2006-10-26T05:00:00"}},
{"id":"2","key":"b","value":{"SP$Sale_Price":"2000000","CONTRACTDATE$Contract_Date":"2006-08-22T05:00:00"}},
{"id":"3","key":"c","value":{"SP$Sale_Price":"780000","CONTRACTDATE$Contract_Date":"2007-01-18T06:00:00"}},
...
In R, what would be the easiest way to produce a scatter-plot of SP$Sale_Price versus CONTRACTDATE$Contract_Date?
I got this far:
install.packages("rjson")
library("rjson")
json_file <- "http://localhost:5984/testdb/_design/sold/_view/sold?limit=100"
json_data <- fromJSON(file=json_file)
install.packages("plyr")
library(plyr)
asFrame <- do.call("rbind.fill", lapply(json_data, as.data.frame))
but now I'm stuck...
> plot(CONTRACTDATE$Contract_Date, SP$Sale_Price)
Error in plot(CONTRACTDATE$Contract_Date, SP$Sale_Price) :
object 'CONTRACTDATE' not found
How to make this work?
Suppose you have the following JSON-file:
txt <- '{"total_rows":141,"offset":0,"rows":[
{"id":"1","key":"a","value":{"SP$Sale_Price":"240000","CONTRACTDATE$Contract_Date":"2006-10-26T05:00:00"}},
{"id":"2","key":"b","value":{"SP$Sale_Price":"2000000","CONTRACTDATE$Contract_Date":"2006-08-22T05:00:00"}},
{"id":"3","key":"c","value":{"SP$Sale_Price":"780000","CONTRACTDATE$Contract_Date":"2007-01-18T06:00:00"}}]}'
Then you can read it as follows with the jsonlite package:
library(jsonlite)
json_data <- fromJSON(txt, flatten = TRUE)
# get the needed dataframe
dat <- json_data$rows
# set convenient names for the columns
# this step is optional, it just gives you nicer columnnames
names(dat) <- c("id","key","sale_price","contract_date")
# convert the 'contract_date' column to a datetime format
dat$contract_date <- strptime(dat$contract_date, format="%Y-%m-%dT%H:%M:%S", tz="GMT")
Now you can plot:
plot(dat$contract_date, dat$sale_price)
Which gives:
If you choose not to flatten the JSON, you can do:
json_data <- fromJSON(txt)
dat <- json_data$rows$value
sp <- strtoi(dat$`SP$Sale_Price`)
cd <- strptime(dat$`CONTRACTDATE$Contract_Date`, format="%Y-%m-%dT%H:%M:%S", tz="GMT")
plot(cd,sp)
Which gives the same plot:
I found a way that doesn't discard the field names:
install.packages("jsonlite")
install.packages("curl")
json <- fromJSON(json_file)
r <- json$rows
At this point r looks like this:
> class(r)
[1] "data.frame"
> colnames(r)
[1] "id" "key" "value"
After some more Googling and trial-and-error I landed on this:
f <- r$value
sp <- strtoi(f[["SP$Sale_Price"]])
cd <- strptime(f[["CONTRACTDATE$Contract_Date"]], format="%Y-%m-%dT%H:%M:%S", tz="GMT")
plot(cd,sp)
And the result on my full data-set...