Problems reading JSON file in R - json

I have a JSON file (an export from mongoDB) that I'd like to load into R. The document is about 890 MB in size with roughly 63,000 rows of 12 fields. The fields are numeric, character and date. I'd like to end up with a 63000 x 12 data frame.
lines <- readLines("fb2013.json")
result: jFile has all 63,000 elements in char class and all fields are lumped into one field.
Each file looks something like this:
"{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
Using rjson,
jFile <- fromJSON(paste(readLines("fb2013.json"), collapse=""))
only the first row is read into jFile but there are 12 fields.
Using RJSONIO:
jFile <- fromJSON(lines)
results in the following:
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
Again, only the first row is read into jFile and there are 12 fields.
The output from rjson and RJSONIO looks something like this:
$`_id`
[1] "1018535"
$comments_count
[1] 0
$created_at
$date
1.357027e+12
$icon
[1] "http://blah.gif"
$likes_count
[1] 20
$link
[1] "http://www.chachacha"
$message
[1] "I'd love to figure this out."
$page_category
[1] "Internet/software"
$page_id
[1] "3924395872345878534"
$page_name
[1] "Not Entirely Hopeless"
$type
[1] "photo"
$updated_at
$date
1.357027e+12

try
library(rjson)
path <- "WHERE/YOUR/JSON/IS/SAVED"
c <- file(path, "r")
l <- readLines(c, -1L)
json <- lapply(X=l, fromJSON)

Since you want a data.frame, try this:
# three copies of your sample...
line.1<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.2<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.3<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
x <- paste(line.1, line.2, line.3, sep="\n")
lines <- readLines(textConnection(x))
library(rjson)
# this is the important bit
df <- data.frame(do.call(rbind,lapply(lines,fromJSON)))
ncol(df)
# [1] 12
# finally, there's some cleaning up to do...
df$created_at
# [[1]]
# [[1]]$`$date`
# [1] 1.357942e+12
# ...
df$created_at <- as.POSIXlt(unname(unlist(df$created_at)/1000),origin="1970-01-01")
df$created_at
# [1] "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST"
df$updated_at <- as.POSIXlt(unname(unlist(df$updated_at)/1000),origin="1970-01-01")
Note that this conversion assumes that the dates were stored as milliseconds since the epoch.

Related

Expected object or value: Importing JSON into Python (Pandas)

I am using Python 3 and I tried with
data = pd.read_json('file.json',encoding="utf-8",orient='records',lines=True)
But It gives me:
ValueError: Expected object or value
This is the structure of the Json file, just a quick sample
{
"_id" : ObjectId("5af1b1fd4f4733eacf11dba9"),
"centralPath" : "XXX2",
"viewStats" : [
{
"totalViews" : NumberInt(3642),
"totalSheets" : NumberInt(393),
"totalSchedules" : NumberInt(427),
"viewsOnSheet" : NumberInt(1949),
"viewsOnSheetWithTemplate" : NumberInt(625),
"schedulesOnSheet" : NumberInt(371),
"unclippedViews" : NumberInt(876),
"createdOn" : ISODate("2017-10-13T18:06:45.291+0000"),
"_id" : ObjectId("59e100b535eeefcc27ee0802")
},
{
"totalViews" : NumberInt(3642),
"totalSheets" : NumberInt(393),
"totalSchedules" : NumberInt(427),
"viewsOnSheet" : NumberInt(1949),
"viewsOnSheetWithTemplate" : NumberInt(625),
"schedulesOnSheet" : NumberInt(371),
"unclippedViews" : NumberInt(876),
"createdOn" : ISODate("2017-10-13T19:11:47.530+0000"),
"_id" : ObjectId("59e10ff3eb0de5740c248df2")
}
]
}
With this method, I am able to see the data but I would like to have
with open('file.json', 'r') as viewsmc:
data = viewsmc.readlines()
With this the output
['{ \n',
' "_id" : ObjectId("5af1b1fd4f4733eacf11dba9"), \n',
' "centralPath" : "XXX2", \n',
' "viewStats" : [\n',
' {\n',
' "totalViews" : NumberInt(3642), \n',
' "totalSheets" : NumberInt(393), \n',
' "totalSchedules" : NumberInt(427), \n',
' "viewsOnSheet" : NumberInt(1949), \n',
' "viewsOnSheetWithTemplate" : NumberInt(625), \n',
' "schedulesOnSheet" : NumberInt(371), \n',
' "unclippedViews" : NumberInt(876), \n',
' "createdOn" : ISODate("2017-10-13T18:06:45.291+0000"), \n',
' "_id" : ObjectId("59e100b535eeefcc27ee0802")\n',
' }, \n',
I tried all different method and solution reported on the read_json / https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
and load/ loads(str) etc. but nothing.
The Issue was with the format of the JSON file,
we tested with https://jsonformatter.curiousconcept.com/ and modify with a regular expression If you have better suggestions let me know.
import re
with open("views3.json", "r+") as read_file:
data = read_file.read()
x = re.sub("\w+\((.+)\)", r'\1', data)
print(x)
read_file.closed
do you want that?: use modul json for reading file.json
import pandas as pd
import json
with open('file.json') as viewsmc:
data = json.load(viewsmc)
print data #you have a dict
df = pd.DataFrame(data)
print(df)

Extract specific info from json-list

In my system from json-call http://192.168.1.6:8080/json.htm?type=devices&rid=89 I get the output below.
{
"ActTime" : 1501360852,
"ServerTime" : "2017-07-29 22:40:52",
"Sunrise" : "05:50",
"Sunset" : "21:28",
"result" : [
{
"AddjMulti" : 1.0,
"AddjMulti2" : 1.0,
"AddjValue" : 0.0,
"AddjValue2" : 0.0,
"BatteryLevel" : 255,
"CustomImage" : 0,
"Data" : "73 Lux",
"Description" : "",
"Favorite" : 1,
"HardwareID" : 4,
"HardwareName" : "Dummies",
"HardwareType" : "Dummy (Does nothing, use for virtual switches only)",
"HardwareTypeVal" : 15,
"HaveTimeout" : true,
"ID" : "82089",
"LastUpdate" : "2017-07-29 21:16:22",
"Name" : "ESP8266C_Licht1",
"Notifications" : "false",
"PlanID" : "0",
"PlanIDs" : [ 0 ],
"Protected" : false,
"ShowNotifications" : true,
"SignalLevel" : "-",
"SubType" : "Lux",
"Timers" : "false",
"Type" : "Lux",
"TypeImg" : "lux",
"Unit" : 1,
"Used" : 1,
"XOffset" : "0",
"YOffset" : "0",
"idx" : "89"
}
],
"status" : "OK",
"title" : "Devices"
}
'Automating' such call by the below scripttime lua-script is aimed at extraction of specific information, to be further used in applications.
The first 11 lines run without problems, but further extraction of the information is a problem.
I have tried various scriptlines to get a solution for A), B), C), D) and E), but either they generate error-report, or they don't give results: see further below for the 'best' trial-scriptlines and related results.
To avoid misunderstanding: those dashed commentlines in the script just below this question with A), B), C), D) and E) are only describing the desired actions/functions and are in no way meant as scriptlines!
Question:
Help requested in the form of better applicable scriptlines for A) till E) in the trialscript at the end of this message, or hints where to find applicable example scriptlines.
-- Lua-script to determine staleness, time-out and value for data from Json-call
print('Start of Timeout-script')
commandArray = {}
TimeOutLimit = 10 -- Allowed timeout in seconds
json = (loadfile "/home/pi/domoticz/scripts/lua/JSON.lua")() -- For Linux
-- json = (loadfile "D:\\Domoticz\\scripts\\lua\\json.lua")() -- For Windows
-- Line 07
local content=assert(io.popen('curl "http://192.168.1.6:8080/json.htm?type=devices&rid=89"')) -- notice double quotes
local list = content:read('*all')
content:close()
local jsonList = json:decode(list)
-- Line 12 Next scriptlines describe desired actions
-- A) Extract ServerTime as numeric value (not as string)
-- B) Extract LastUpdate as numeric value (not as string)
-- Staleness = ServerTime - LastUpdate
-- C) Extract HaveTimeout as boolean (not as string)
-- If HaveTimeout and (Staleness > TimeOutlimit) then
-- Print('TimeOutLimit exceeded by ' .. (Staleness - TimeOutLimit) .. 'seconds')
-- End
-- D) Extract textstring from Type or Data
-- E) Extract numeric value from Data
print('End of Timeout-script')
return commandArray
For lines 11 etc, the following trial-scriptlines gave 'best' results (= no errors):
-- Line 11
-- local Servertime = json:decode(ServerTime)
-- print('Servertime : '..Servertime)
-- Line 14
-- CheckTimeOut =jsonValue.result[1].HaveTimeout -- value from "HaveTimeout", inside "result" bloc number 1 (even if it's the only one)
CurrentServerTime =jsonValue.Servertime -- value from "ServerTime"
CurrentLastUpdate = jsonValue.result[1].LastUpdate
CurrentData = jsonValue.result[1].Data
-- Line 19
print('TimeOut : '..CheckTimeOut)
print('Servertime : '..CurrentServerTime)
print('LastUpdate : '..CurrentLastUpdate)
print('Data-content : '..CurrentData)
print('End of Timeout-script')
return commandArray
Results:
Without dashes before the lines 12 and 13, respectively 15, then the following error reports:
660: nil passed to JSON:decode()
lua:15: attempt to index global 'jsonValue' (a nil value)
With dashes before lines 12, 13 and 15 for the trial-scriptlines shown above, according to the log no errors exist (as demonstrated by the 2 prints)
2017-07-31 16:30:02.520 LUA: Start of Timeout-script
2017-07-31 16:30:02.563 LUA: End of Timeout-script
But why no print-results in the log from Lines 20 till 23?
Not having those print-results makes it difficult to determine next steps in data-extraction, to achieve the objectives described under A) till E).
;-) Error reports generally are more useful information than "no errors, but no results"
Without downloading content over the network, the working script looks like this:
local json = require("json")
local jj = [[
{
"ActTime" : 1501360852,
"ServerTime" : "2017-07-29 22:40:52",
"Sunrise" : "05:50",
"Sunset" : "21:28",
"result" : [
{
"AddjMulti" : 1.0,
"AddjMulti2" : 1.0,
"AddjValue" : 0.0,
"AddjValue2" : 0.0,
"BatteryLevel" : 255,
"CustomImage" : 0,
"Data" : "73 Lux",
"Description" : "",
"Favorite" : 1,
"HardwareID" : 4,
"HardwareName" : "Dummies",
"HardwareType" : "Dummy (Does nothing, use for virtual switches only)",
"HardwareTypeVal" : 15,
"HaveTimeout" : true,
"ID" : "82089",
"LastUpdate" : "2017-07-29 21:16:22",
"Name" : "ESP8266C_Licht1",
"Notifications" : "false",
"PlanID" : "0",
"PlanIDs" : [ 0 ],
"Protected" : false,
"ShowNotifications" : true,
"SignalLevel" : "-",
"SubType" : "Lux",
"Timers" : "false",
"Type" : "Lux",
"TypeImg" : "lux",
"Unit" : 1,
"Used" : 1,
"XOffset" : "0",
"YOffset" : "0",
"idx" : "89"
}
],
"status" : "OK",
"title" : "Devices"
}
]]
print('Start of Timeout-script')
local jsonValue = json.decode(jj)
CheckTimeOut =jsonValue.result[1].HaveTimeout
CurrentServerTime =jsonValue.Servertime
CurrentLastUpdate = jsonValue.result[1].LastUpdate
CurrentData = jsonValue.result[1].Data
print('TimeOut : '.. (CheckTimeOut and "true" or "false") )
print('Servertime : '.. (CurrentServerTime or "nil") )
print('LastUpdate : '.. (CurrentLastUpdate or "nil") )
print('Data-content : '.. (CurrentData or "nil") )
print('End of Timeout-script')
result:
Start of Timeout-script
TimeOut : true
Servertime : nil
LastUpdate : 2017-07-29 21:16:22
Data-content : 73 Lux
End of Timeout-script

JSON - Character issue

I extracted some data from a mongo database using the RMongo library. I have been working with the data with no problem. However, I need to access a field that was saved, originally in the database, as JSON. Since rmongodb saves the data as data frame, I now have a large character vector of length 1:
res1 = "[ { \"text\" : \"#Kayture Beyoncé jam session ?\" , \"name\" : \"beponcé \xed\xa0\xbc\xed\xbc\xbb\" , \"screenName\" : \"ColaaaaTweedy\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"108061963\"} , { \"text\" : \"#Kayture fucking marry me\" , \"name\" : \"George McQueen\" , \"screenName\" : \"GeorgeMcQueen12\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"67896750\"}]"
I need to extract all the "text" attributes of the objects from this array (there are 2 in this example), but I can not figure out a fast way. I was trying using strsplit, or going from character to json files using jsonlite, and then to list, but it does not work.
Any ideas?
Thanks!
Starting from
res1 = "[ { \"text\" : \"#Kayture Beyoncé jam session ?\" , \"name\" : \"beponcé \xed\xa0\xbc\xed\xbc\xbb\" , \"screenName\" : \"ColaaaaTweedy\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"108061963\"} , { \"text\" : \"#Kayture fucking marry me\" , \"name\" : \"George McQueen\" , \"screenName\" : \"GeorgeMcQueen12\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"67896750\"}]"
you can use fromJSON() from the jsonlite package to parse that JSON object.
library(jsonlite)
fromJSON(res1)
text name screenName follower mentions userTwitterId
1 #Kayture Beyoncé jam session ? beponcé í ¼í¼» ColaaaaTweedy FALSE Kayture 108061963
2 #Kayture fucking marry me George McQueen GeorgeMcQueen12 FALSE Kayture 67896750

Loading JSON data to a list in a particular order using PyMongo

Let's say I have the following document in a MongoDB database:
{
"assist_leaders" : {
"Steve Nash" : {
"team" : "Phoenix Suns",
"position" : "PG",
"draft_data" : {
"class" : 1996,
"pick" : 15,
"selected_by" : "Phoenix Suns",
"college" : "Santa Clara"
}
},
"LeBron James" : {
"team" : "Cleveland Cavaliers",
"position" : "SF",
"draft_data" : {
"class" : 2003,
"pick" : 1,
"selected_by" : "Cleveland Cavaliers",
"college" : "None"
}
},
}
}
I'm trying to collect a few values under "draft_data" for each player in an ORDERED list. The list needs to look like the following for this particular document:
[ [1996, 15, "Phoenix Suns"], [2003, 1, "Cleveland Cavaliers"] ]
That is, each nested list must contain the values corresponding to the "pick", "selected_by", and "class" keys, in that order. I also need the "Steve Nash" data to come before the "LeBron James" data.
How can I achieve this using pymongo? Note that the structure of the data is not set in stone so I can change this if that makes the code simpler.
I'd extract the data and turn it into a list in Python, once you've retrieved the document from MongoDB:
for doc in db.collection.find():
for name, info in doc['assist_leaders'].items():
draft_data = info['draft_data']
lst = [draft_data['class'], draft_data['pick'], draft_data['selected_by']]
print name, lst
List comprehension is the way to go here (Note: don't forget .iteritems() in Python2 or .items() in Python3 or you'll get a ValueError: too many values to unpack).
import pymongo
import numpy as np
client = pymongo.MongoClient()
db = client[database_name]
dataList = [v for i in ["Steve Nash", "LeBron James"]
for key in ["class", "pick", "selected_by"]
for document in db.collection_name.find({"assist_leaders": {"$exists": 1}})
for k, v in document["assist_leaders"][i]["draft_data"].iteritems()
if k == key]
print dataList
# [1996, 15, "Phoenix Suns", 2003, 1, "Cleveland Cavaliers"]
matrix = np.reshape(dataList, [2,3])
print matrix
# [ [1996, 15, "Phoenix Suns"],
# [2003, 1, "Cleveland Cavaliers"] ]

Ragged list or data frame to JSON

I am trying to create a ragged list in R that corresponds to the D3 tree structure of flare.json. My data is in a data.frame:
path <- data.frame(P1=c("direct","direct","organic","direct"),
P2=c("direct","direct","end","end"),
P3=c("direct","organic","",""),
P4=c("end","end","",""), size=c(5,12,23,45))
path
P1 P2 P3 P4 size
1 direct direct direct end 5
2 direct direct organic end 12
3 organic end 23
4 direct end 45
but it could also be a list or reshaped if necessary:
path <- list()
path[[1]] <- list(name=c("direct","direct","direct","end"),size=5)
path[[2]] <- list(name=c("direct","direct","organic","end"), size=12)
path[[3]] <- list(name=c("organic", "end"), size=23)
path[[4]] <- list(name=c("direct", "end"), size=45)
The desired output is:
rl <- list()
rl <- list(name="root", children=list())
rl$children[1] <- list(list(name="direct", children=list()))
rl$children[[1]]$children[1] <- list(list(name="direct", children=list()))
rl$children[[1]]$children[[1]]$children[1] <- list(list(name="direct", children=list()))
rl$children[[1]]$children[[1]]$children[[1]]$children[1] <- list(list(name="end", size=5))
rl$children[[1]]$children[[1]]$children[2] <- list(list(name="organic", children=list()))
rl$children[[1]]$children[[1]]$children[[2]]$children[1] <- list(list(name="end", size=12))
rl$children[[1]]$children[2] <- list(list(name="end", size=23))
rl$children[2] = list(list(name="organic", children=list()))
rl$children[[2]]$children[1] <- list(list(name="end", size=45))
So when I print to json it's:
require(RJSONIO)
cat(toJSON(rl, pretty=T))
{
"name" : "root",
"children" : [
{
"name" : "direct",
"children" : [
{
"name" : "direct",
"children" : [
{
"name" : "direct",
"children" : [
{
"name" : "end",
"size" : 5
}
]
},
{
"name" : "organic",
"children" : [
{
"name" : "end",
"size" : 12
}
]
}
]
},
{
"name" : "end",
"size" : 23
}
]
},
{
"name" : "organic",
"children" : [
{
"name" : "end",
"size" : 45
}
]
}
]
}
I am having a hard time wrapping my head around the recursive steps that are necessary to create this list structure in R. In JS I can pretty easily move around the nodes and at each node determine whether to add a new node or keep moving down the tree by using push as needed, eg: new = {"name": node, "children": []}; or new = {"name": node, "size": size}; as in this example. I tried to split the data.frame as in this example:
makeList<-function(x){
if(ncol(x)>2){
listSplit<-split(x,x[1],drop=T)
lapply(names(listSplit),function(y){list(name=y,children=makeList(listSplit[[y]]))})
} else {
lapply(seq(nrow(x[1])),function(y){list(name=x[,1][y],size=x[,2][y])})
}
}
jsonOut<-toJSON(list(name="root",children=makeList(path)))
but it gives me an error
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
The function given in the linked Q&A is essentially what you need, however it was failing on your data set because of the null values for some rows in the later columns. Instead of just blindly repeating the recursion until you run out of columns, you need to check for your "end" value, and use that to switch to making leaves:
makeList<-function(x){
listSplit<-split(x[-1],x[1], drop=TRUE);
lapply(names(listSplit),function(y){
if (y == "end") {
l <- list();
rows = listSplit[[y]];
for(i in 1:nrow(rows) ) {
l <- c(l, list(name=y, size=rows[i,"size"] ) );
}
l;
}
else {
list(name=y,children=makeList(listSplit[[y]]))
}
});
}
I believe this does what you want, though it has some limitations. In particular, it is assumed that every branch in your network is unique (i.e. there can't be two rows in your data frame that are equal for every column other than size):
df.split <- function(p.df) {
p.lst.tmp <- unname(split(p.df, p.df[, 1]))
p.lst <- lapply(
p.lst.tmp,
function(x) {
if(ncol(x) == 2L && nrow(x) == 1L) {
return(list(name=x[1, 1], size=unname(x[, 2])))
} else if (isTRUE(is.na(unname(x[ ,2])))) {
return(list(name=x[1, 1], size=unname(x[, ncol(x)])))
}
list(name=x[1, 1], children=df.split(x[, -1, drop=F]))
}
)
p.lst
}
all.equal(rl, df.split(path)[[1]])
# [1] TRUE
Though note you had the organic size switched, so I had to fix your rl to get this result (rl has it as 45, but your path as 23). Also, I modified your path data.frame slightly:
path <- data.frame(
root=rep("root", 4),
P1=c("direct","direct","organic","direct"),
P2=c("direct","direct","end","end"),
P3=c("direct","organic",NA,NA),
P4=c("end","end",NA,NA),
size=c(5,12,23,45),
stringsAsFactors=F
)
WARNING: I haven't tested this with other structures, so it's possible it will hit corner cases that you'll need to debug.