Loading json encoded log files into R for analysis - json

I have a log file with each line a json-encoded entry:
{"requestId":"5550d","partnerId":false,"ip":"170.158.3.1", ... }
I tried reading each line, json_decode and then append to the dataframe:
loadLogs <- function(fileName) {
conn <- file(fileName, "r", blocking = FALSE)
linn <- readLines(conn)
long <- length(linn)
df = data.frame(requestId = character(0),
partnerId = character(0),
ip = character(0)
)
for (i in 1:long) {
jsonRow <- fromJSON(linn[i])
df <- rbind(df, data.frame(requestId = jsonRow$requestId,
partnerId = as.character(jsonRow$partnerId),
ip = jsonRow$ip
}
close(conn)
return(df)
}
The above code is extremely slow for large files though. Is there any way to speed this up? A few options I can think of at the moment:
pre-allocate the dataframe as its copying the entire data for a new append
using apply function on the json_decode
???
How would I do (1) and (2) in R? I'm new.
Thanks for looking at my question.

Related

Shiny - assigning argument to function

I am trying to create a shiny app that applies a self-made function to an uploaded dataset, then allows to download the modified results. Here is my code:
library(shiny)
library(tidyverse)
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data$a_spalte))
for (i in 1:length(data$a_spalte)){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
ui <- fluidPage(
fileInput('upload','Deine Kartierungsdaten'),
textInput('art', 'Wie heißt die Spalte mit Artnamen?'),
downloadButton('analyse','Artenkürzel hinzufügen')
)
server <- function(input, output, session) {
data <- reactive({
req(input$upload)
ext <- tools::file_ext(input$upload$name)
switch(ext,
csv = vroom::vroom(input$upload$datapath, delim = ";"),
validate("Invalid file; Please upload a .csv file")
)
})
art <- reactive(input$art)
output$analyse <- downloadHandler(
filename = function() {
paste0('mit_kuerzel', ".csv")
},
content = function(file) {
ergebnis <- reactive(namkurz(data(), art()))
vroom::vroom_write(ergebnis(), file)
}
)
}
shinyApp(ui, server)
When trying to save the output I get a 'Warning: Unknown or uninitialised column:' error. I think my problem is in the assignment of argument 'art' to the 'ergebnis' object, but I can't find the way to fix it.
I recommend a few things:
(Required) In your function, a_spalte is a character vector and not the literal name of column in the frame, so you need to use [[ instead of $, see The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe and Dynamically select data frame columns using $ and a character value.
Change all references of data$a_spalte to data[[a_spalte]].
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data[[a_spalte]]))
for (i in 1:length(data[[a_spalte]])){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
Your function is a bit inefficient doing things row-wise, we can vectorize that operation.
namkurz <- function(data, a_spalte) {
spez <- strsplit(data$Art, " ", fixed = TRUE)
data$kurz <- sapply(spez, function(z) paste(substr(z, 1, 2), collapse = " "))
data
}
(Optional) The content= portion of downloadHandler is already reactive, you do not need to wrap namkurz in reactive. Because of this, you also don't need to treat ergebnis as reactive.
output$analyse <- downloadHandler(
filename = ...,
content = function(file) {
ergebnis <- namkurz(data(), art())
vroom::vroom_write(ergebnis, file)
}
)
(Optional) Your output filename is fixed, so two things here: if it's always going to be "mit_kuerzel.csv", then there's no need for paste0, just use function() "mit_kuerzel.csv".
However, if you are intending to return a file named something based on the original input filename, one could do something like:
filename = function() {
paste0(tools::file_path_sans_ext(basename(input$upload$name)),
"_mit_kuerzel.",
tools::file_ext(input$upload$name))
},
to add _mit_kuerzel to the base portion of the uploaded filename. Note that the file in the content= section is never this name, the new_mit_kuerzel.csv is the filename offered to the downloading browser as a suggestion, that is all.
(Optional) You are using a .csv file extension in the downloadHandler, but the default for vroom::vroom_write is to use delim = "\t", which is not a CSV. I suggest either adding delim = ";" (or similar), or changing the returned filename extension to .tsv instead.

calling a function in R script

I wrote a R script in which I wrote a function and called the function. here is the whole script:
PrepData = function(infile){
data <- read.table(infile, header=TRUE, as.is = TRUE, sep = ",")
data = data[, 2:ncol(data)]
merged.data = data
colnames(merged.data[1]) < "CodeCount"
rownames(merged.data) <- merged.data$Name
x <- list(counts = merged.data, raw.counts = merged.data)
return(x)
}
data <- PrepData(myfile.csv)
data
but when I run it using the following command:
Rscript myscript.r
it gives this error:
Error in read.table(infile, header = TRUE, as.is = TRUE, sep = ",") :
object 'myfile.csv' not found
Calls: PrepData -> read.table
Execution halted
do you know how to fix it?
Try to change
data <- PrepData(myfile.csv)
To
data <- PrepData("myfile.csv")
You need to have quotation mark on the filename when you use read table function.

save html from data.frames to pdf using wkhtmltopdf or Markdown

I have a df with a column htmltext containing html text that I would like to print (as a batch if possible) as single PDFs with doc_id as filename.
Can I do that directly within R?
I thought about something like
> system("wkhtmltopdf --javascript-delay 1 in.html out.pdf")
how can I implement that in R?
or is there another easy way to to so using markdown for example.
# df
doc_id <- c("doc1","doc2","doc3")
htmltext <- c("<b>good morning</b>","<b>This text is bold</b>","<b>good evening</b>")
df <- data.frame(doc_id,htmltext, stringsAsFactors = FALSE)
# save htmltext single pdfs with doc_id as filename
filenames = filenames = df$doc_id
...?
See if one of these is acceptable:
library(rmarkdown)
library(decapitated) # devtools::install_github("hrbrmstr/decapitated") # requires Chrome
data.frame(
doc_id = c("doc1", "doc2", "doc3"),
htmltext = c("<b>good morning</b>", "<b>This text is bold</b>", "<b>good evening</b>"),
stringsAsFactors = FALSE
) -> xdf
# hackish pandoc way
for(i in 1:nrow(xdf)) {
message(sprintf("Processing %s", xdf$doc_id[i]))
tf <- tempfile(fileext=".html")
writeLines(xdf$htmltext[i], tf)
pandoc_convert(
input = tf,
to = "latex",
output = sprintf("%s.pdf", xdf$doc_id[i]),
wd = getwd()
)
unlink(tf)
}
# using headless chrome
for(i in 1:nrow(xdf)) {
message(sprintf("Processing %s", xdf$doc_id[i]))
tf <- tempfile(fileext=".html")
writeLines(xdf$htmltext[i], tf)
chrome_dump_pdf(sprintf("file://%s", tf), path=sprintf("%s.pdf", xdf$doc[i]))
unlink(tf)
}

For loop - doesn't work properly

I'm trying to use Indeed API to search for specific jobs and I faced a problem when for loop doesn't go through each iterations.
Here is the example of code that I used:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
for(i in keywords) {
url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
Url leads to JSON file and adding one of the keyword to the url will change the JSON file. So I'm trying to repeat this for all keywords and store the result in the dataframe. However, when I'm trying to use more keywords I'm getting the result only for a few first keywords.
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data_scientist", "data+analyst")
data<-data.table(NULL)#initialization of object
for(i in keywords) {
url <- paste0(Original_url_1,i,Original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
>dim(data)
[1] 39 31
Here is the correct code:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
data <- data.frame()
for (i in keywords) {
tryCatch({url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}, error = function(t){})
}

is it possible to process file reading and parsing in R

There are bunch of files in a directory that has json formatted entries in each line. The size of the files varies from 5k to 200MB. I have this code to go though each file, parse the data I am looking for in the json and finally form a data frame. This script is taking a very long time to finish, in fact it never finishes.
Is there any way to speed it up so that I can read the files faster?
Code:
library(jsonlite)
library(data.table)
setwd("C:/Files/")
#data <- lapply(readLines("test.txt"), fromJSON)
df<-data.frame(Timestamp=factor(),Source=factor(),Host=factor(),Status=factor())
filenames <- list.files("Json_files", pattern="*.txt", full.names=TRUE)
for(i in filenames){
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
df<-rbind(df,myDf)
}
This is a sample entry but there are thousands of entries like this in the file:
{"senderDateTimeStamp":"2016/04/08 10:53:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB01","servermember":"test"},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:54:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:55:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
With your example data in "c:/tmp.txt":
> df <- jsonlite::fromJSON(paste0("[",paste0(readLines("c:/tmp.txt"),collapse=","),"]"))$payloadData[c("timestamp","source","host","status")]
> df
timestamp source host status
1 2016-04-08T10:53:18.169 STREAM WEB01 get
2 2016-04-08T10:53:18.169 STREAM WEB02 get
3 2016-04-08T10:53:18.169 STREAM WEB02 get
So to adapt your code to get a list of dataframes:
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]")
)$payloadData[c("timestamp","source","host","status")]
})
The idea is to transform your lines (from readLines) into a big json array and then create the dataframe by parsing it as json.
As lmo already showcased, using lapply on your filenmaes list procide you with a list of dataframes, if you really want only one dataframe at end you can load the data.table packages and then use rbindlist on dflist to get only one dataframe.
Or if you're short in memory this thread may help you.
One speed up is to replace your for loop with lapply Then drop the final rbind. the speed up here would be that R would not have to repeatedly copy an increasingly large file, df over your "bunch" of files. The result would be stored in a convenient list that you could either use as is or convert to a data.frame in one go:
# create processing function
getData <- function(i) {
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
}
# lapply over files
myDataList <- lapply(filenames, getData)