Iteratively read a fixed number of lines into R - json

I have a josn file I'm working with that contains multiple json objects in a single file. R is unable to read the file as a whole. But since each object occurs at regular intervals, I would like to iteratively read a fixed number of lines into R.
There are a number of SO questions on reading single lines into R but I have been unable to extend these solutions to a fixed number of lines. For my problem I need to read 16 lines into R at a time (eg 1-16, 17-32 etc)
I have tried using a loop but can't seem to get the syntax right:
## File
file <- "results.json"
## Create connection
con <- file(description=file, open="r")
## Loop over a file connection
for(i in 1:1000) {
tmp <- scan(file=con, nlines=16, quiet=TRUE)
data[i] <- fromJSON(tmp)
}
The file contains over 1000 objects of this form:
{
"object": [
[
"a",
0
],
[
"b",
2
],
[
"c",
2
]
]
}

With #tomtom inspiration I was able to find a solution.
## File
file <- "results.json"
## Loop over a file
for(i in 1:1000) {
tmp <- paste(scan(file=file, what="character", sep="\n", nlines=16, skip=(i-1)*16, quiet=TRUE),collapse=" ")
assign(x = paste("data", i, sep = "_"), value = fromJSON(tmp))
}
I couldn't create a connection as each time I tried the connection would close before the file had been completely read. So I got rid of that step.
I had to include the what="character" variable as scan() seems to expect a number by default.
I included sep="\n", paste() and collapse=" " to create a single string rather than the vector of characters that scan() creates by default.
Finally I just changed the final assignment operator to have a bit more control over the names of the output.

This might help:
EDITED to make it use a list and Reduce into one file
## Loop over a file connection
data <- NULL
for(i in 1:1000) {
tmp <- scan(file=con, nlines=16, skip=(i-1)*16, quiet=TRUE)
data[[i]] <- fromJSON(tmp)
}
df <- Reduce(function(x, y) {paste(x, y, collapse = " ")})
You would have to make sure that you don't reach further than the end of the file though ;-)

Related

Running timeseries graphing function in Rmd producing cluttered x-axis labels (not present in test code)

I have a folder of xx .csv timeseries that I want to graph and knit into a clean HTML document. I have a ggplot code that produces the plot that I want using a single timeseries.csv. However, when I try to put the bones of that ggplot code in a function inside of a for loop to run each of the timeseries.csv files through the function I get a some plots with pretty different formatting.
Plot generated with my test ggplot code:
Plot generated with function and for loop:
Changes I'm trying to make to the ugly Rmd plot:
Nicely space the x-axis tick marks to whole mins (i.e. "11:14:00", "11:15:00")
Connect the data points (solved with subbing geom_line() with geom_path())
Example Rmd Code Below. Please Note that the graphs produced still have nice formatting, I'm not sure how to reproduce this problem sort of posting a 500 row dataframe. I also don't know how to post my rmd code without SO using the formatting commands in this post, so I threw in at 3 of " around my header formatting and at the end of the code to disable it.
Edits and Updates
I am getting a persistent error geom_path: Each group consists of only one observation. Do you need to adjust the group
aesthetic?.
As suggested by the commenters I tried removing plot() and using the the createChlDiffPlot() directly and replacing plot() with print(). Both produce the same ugly plots as before.
Replaced geom_line() with geom_path(). The points are now connected! x-axis cluttering is still there.
Time variable is reading as hms num
Many thanks for any help on this!
```
---
title: "Chl Filtration"
output:
flexdashboard::flex_dashboard:
theme: yeti
orientation: rows
editor_options:
chunk_output_type: console
---
```{r setup}
library(flexdashboard)
library(dplyr)
library(ggplot2)
library(hms)
library(ggthemes)
library(readr)
library(data.table)
#### Example Data
df1 <- data.frame(Time = as_hms(c("11:22:33","11:22:34","11:22:35","11:22:38","11:23:00","11:23:01","11:23:02")),
Chl_ug_L_Up = c(0.2,0.1,0.25,-0.2,-0.3,-0.15,0.1),
Chl_ug_L_Down = c(0.5,0.4,0.3,0.2,0.1,0,-0.1))
df2 <- data.frame(Time = as_hms(c("08:02:33","08:02:34","08:02:35","08:02:40","08:02:42","08:02:43","08:02:49")),
Chl_ug_L_Up = c(-0.2,-0.1,-0.25,0.2,0.3,0.15,-0.1),
Chl_ug_L_Down = c(-0.1,0,0.1,0.2,0.3,0.4,0.1))
data_directory = "./" # data folder in R project folder in the real deal
output_directory = "./" # output graph directory in R project folder
write_csv(df1, file.path(data_directory, "SO_example_df1.csv"))
write_csv(df2, file.path(data_directory, "SO_example_df2.csv"))
#### Function to create graphs
createChlDiffPlot = function(aTimeSeriesFile, aFileName, aGraphOutputDirectory, aType)
{
aFile_Mod = aTimeSeriesFile %<>%
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down)
one_plot = ggplot(data = aFile_Mod, aes(x = Time, y = Chl_diff)) + # tried adding 'group = 1' in aes to connect points
geom_path(size = 1, color = "green") +
geom_point(color = "green") +
theme_gdocs() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank()) +
labs(x = "", y = "Chl Difference", title = paste0(aFileName, " - ", "Filtration"))
one_graph_name = paste0(gsub(".csv", "", aFileName), "_", aType, ".pdf")
ggsave(one_graph_name, one_plot, dpi = 600, width = 7, height = 5, units = "in", device = "pdf", aGraphOutputDirectory)
return(one_plot)
}
"``` ### remove the quotes when running example
Plots - After Velocity Adjustment
=====================================" ### remove quotes when running example
```{r, fig.width=13.5, fig.height=5}
all_files_Filtration = list.files(data_directory, pattern = ".csv")
# Loop to plot function
for(file in 1 : length(all_files_Filtration))
{
file_name = all_files_Filtration[file]
one_file = fread(file.path(data_directory, file_name))
# plot the time series agains
plot(createChlDiffPlot(one_file, file_name, output_directory, "Velocity_Paired"))
}
"``` #remove quotes when running example
```
I finally figured it out.
1) Replacing geom_line() with geom_path() connected the data points when rendered in Rmd.
2) df1$Time was formatted as a difftime object. When I looked at the dataframe in the global environment, Time :hmsnum 11:11:09 .... This made me think my format was ok, but when I ran class(df1$Time) I got [1] "hms" "difftime". With a quick google I found out difftime objects are not quite the same as hms, and my original time was generated by subtracting times. I added a conversion into my mutate function:
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down,
Time = as_hms(Time)) # convert difftime objecct to hms
ggplot I think has some auto-formatting for hms variables, which is why difftime variable was producing ugly crowded x- axes.

Error while trying to parse json into R

I have recently started using R and have a task regarding parsing json in R to get a non-json format. For this, i am using the "fromJSON()" function. I have tried to parse json as a text file. It runs successfully when i do it with just a single row entry. But when I try it with multiple row entries, i get the following error:
fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: invalid char in json text.
[{'CategoryType':'dining','City':
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
"mumbai","Location":"all"}] [{"JourneyType":"Return","Origi
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: after array element, I expect ',' or ']'
:"mumbai","Location":"all"} {"JourneyType":"Return","Origin
(right here) ------^
The above errors are due to three different formats in which i tried to parse the json text, but the result was the same, only the location suggested by changed.
Please help me to identify the cause of this error or if there is a more efficient way o performing the task.
The original file that i have is an excel sheet with multiple columns and one of those columns consists of json text. The way i tried right now is by extracting just the json column and converting it to a tab separated text and then parsing it as:
fromJSON("D:/Eclairs/Printing/test3.txt")
Please also suggest if this can be done more efficiently. I need to map all the columns in the excel to the non-json text as well.
Example:
[{"CategoryType":"dining","City":"mumbai","Location":"all"}]
[{"CategoryType":"reserve-a-table","City":"pune","Location":"Kothrud,West Pune"}]
[{"Destination":"Mumbai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"3","NoOfAdult":"3","NoOfChildren":"0"}]
Consider reading in the text line by line with readLines(), iteratively saving the JSON dataframes to a growing list:
library(jsonlite)
con <- file("C:/Path/To/Jsons.txt", open="r")
jsonlist <- list()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
jsonlist <- append(jsonlist, list(fromJSON(line)))
}
close(con)
jsonlist
# [[1]]
# CategoryType City Location
# 1 dining mumbai all
# [[2]]
# CategoryType City Location
# 1 reserve-a-table pune Kothrud,West Pune
# [[3]]
# Destination CheckInDate CheckOutDate Rooms NoOfPax NoOfAdult NoOfChildren
# 1 Mumbai 14-Oct-2016 15-Oct-2016 1 3 3 0

Getting data from JSON file in R

Lets say that I have the following json file:
{
"id": "000018ac-04ef-4270-81e6-9e3cb8274d31",
"currentCompany": "",
"currentTitle": "--",
"currentPosition": ""
}
I use the following code:
Usersfile <- ('trial.json') #where trial the json above
library('rjson')
c <- file(Usersfile,'r')
l <- readLines(c,-71L)
json <- lapply(X=l,fromJSON)
and I have the following error:
Error: parse error: premature EOF
{
(right here) ------^
But when I enter the json file(with notepad) and put the data in one line:
{"id": "000018ac-04ef-4270-81e6-9e3cb8274d31","currentCompany": "","currentTitle": "--","currentPosition": ""}
The code works fine.(In reality the file is really big to do it manually for each line). Why is this happening? How can I overcome that?
Also this one doesnt work:
{ "id": "000018ac-04ef-4270-81e6-9e3cb8274d31","currentCompany": "","currentTitle": "--","currentPosition": ""
}
EDIT: I used the following code that I could read only the first value:
library('rjson')
c <- file.path(Usersfile)
data <- fromJSON(file=c)
Surprised this was never answered! Using the jsonlite package, you can collapse your json data into one character element using paste(x, collapse="") removing EOF markers for proper import into an R dataframe. I, too, faced a pretty-printed json with exact error:
library(jsonlite)
json <- do.call(rbind,
lapply(paste(readLines(Usersfile, warn=FALSE),
collapse=""),
jsonlite::fromJSON))

get information from dataset (json format) in r

I created a datatable from mongodb collection. Data in this datatable is in JSON format but I cant get to extract the information from it..
{"place":{"bounding_box":{
"type":"Polygon",
"coordinates":[
[
[
-119.932568,
36.648905
],
[
-119.632419,
36.648905
]
]
]
}}}
I need the first two values of the coordinates: lat = 36.648905 and lon = -119.932568
But cant seems to extract that info:
my_lon <- myBigDF$place.bounding_box.coordinates[1[1[1]]]
I have tried few combination but I'm always getting NULL.
Thank you for any help..
--EDIT-- Including the code on how I'm connecting to db and creating dataframe from it..
mongo <- mongo.create(host="localhost" , db="mydb")
library(plyr)
## create the empty data frame
myDF = data.frame(stringsAsFactors = FALSE)
## create the cursor we will iterate over, basically a select * in SQL
cursor = mongo.find(mongo, namespace)
## create the counter
i = 1
## iterate over the cursor
while (mongo.cursor.next(cursor)) {
# iterate and grab the next record
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
# make it a dataframe
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors = F)
# bind to the master dataframe
myDF = rbind.fill(myDF, tmp.df)
}
It's hard to tell exactly how you are going from the JSON string to an R object. There are different libraries that parse thing differently. If I assume for a moment use "rjson", then you would have something like
x <- rjson::fromJSON('{"place":{"bounding_box":{ "type":"Polygon", "coordinates":[ [ [ -119.932568, 36.648905 ], [ -119.632419, 36.648905 ] ] ] }}}')
And because your data seems to have an excessive number of square brackets, things are a bit messy. You can get to the coordinates section with
x$place$bounding_box$coordinates
# [1]]
# [[1]][[1]]
# [1] -119.9326 36.6489
#
# [[1]][[2]]
# [1] -119.6324 36.6489
which is a list of lists of vectors. To make a nice matrix of lat/long coordinates you can do
do.call(rbind, x$place$bounding_box$coordinates[[1]])

readHTMLTable in R throws warning within for loop

Hi I have 5 html sources, in which I want to run readHTMLTable on each and store the result. I can do this individually using:
readHTMLTable(iso.content[1],which=6)
readHTMLTable(iso.content[2],which=6)
.
.
however when putting this into a for loop I get:
library(XML)
> iso.table<-NULL
> for (i in 1:nrow(gene.iso)) {
+ iso.table[i]<-readHTMLTable(iso.content[i],which=6)
+ }
Warning messages:
1: In iso.table[i] <- readHTMLTable(iso.content[i], which = 6) :
number of items to replace is not a multiple of replacement length
2: In iso.table[i] <- readHTMLTable(iso.content[i], which = 6) :
number of items to replace is not a multiple of replacement length
3: In iso.table[i] <- readHTMLTable(iso.content[i], which = 6) :
number of items to replace is not a multiple of replacement length
4: In iso.table[i] <- readHTMLTable(iso.content[i], which = 6) :
number of items to replace is not a multiple of replacement length
5: In iso.table[i] <- readHTMLTable(iso.content[i], which = 6) :
number of items to replace is not a multiple of replacement length
So I can do this individually, but not using a for loop. It is not my aim to replace the current data with the next iteration, so I am unsure why the warning presents this.
any ideas?
The error has nothing to do with readHTMLTable really; it's all about iso.table. I'm not sure what type of object you wanted that to be, but if you want to store a bunch of data.frames, you're going to need a list. And when you're assigning objects to a list, you want to place them with [[ ]] not [ ]. Try
iso.table <- list()
for (i in 1:nrow(gene.iso)) {
iso.table[[i]] <- readHTMLTable(iso.content[i],which=6)
}