I am using Python 3 and I tried with
data = pd.read_json('file.json',encoding="utf-8",orient='records',lines=True)
But It gives me:
ValueError: Expected object or value
This is the structure of the Json file, just a quick sample
{
"_id" : ObjectId("5af1b1fd4f4733eacf11dba9"),
"centralPath" : "XXX2",
"viewStats" : [
{
"totalViews" : NumberInt(3642),
"totalSheets" : NumberInt(393),
"totalSchedules" : NumberInt(427),
"viewsOnSheet" : NumberInt(1949),
"viewsOnSheetWithTemplate" : NumberInt(625),
"schedulesOnSheet" : NumberInt(371),
"unclippedViews" : NumberInt(876),
"createdOn" : ISODate("2017-10-13T18:06:45.291+0000"),
"_id" : ObjectId("59e100b535eeefcc27ee0802")
},
{
"totalViews" : NumberInt(3642),
"totalSheets" : NumberInt(393),
"totalSchedules" : NumberInt(427),
"viewsOnSheet" : NumberInt(1949),
"viewsOnSheetWithTemplate" : NumberInt(625),
"schedulesOnSheet" : NumberInt(371),
"unclippedViews" : NumberInt(876),
"createdOn" : ISODate("2017-10-13T19:11:47.530+0000"),
"_id" : ObjectId("59e10ff3eb0de5740c248df2")
}
]
}
With this method, I am able to see the data but I would like to have
with open('file.json', 'r') as viewsmc:
data = viewsmc.readlines()
With this the output
['{ \n',
' "_id" : ObjectId("5af1b1fd4f4733eacf11dba9"), \n',
' "centralPath" : "XXX2", \n',
' "viewStats" : [\n',
' {\n',
' "totalViews" : NumberInt(3642), \n',
' "totalSheets" : NumberInt(393), \n',
' "totalSchedules" : NumberInt(427), \n',
' "viewsOnSheet" : NumberInt(1949), \n',
' "viewsOnSheetWithTemplate" : NumberInt(625), \n',
' "schedulesOnSheet" : NumberInt(371), \n',
' "unclippedViews" : NumberInt(876), \n',
' "createdOn" : ISODate("2017-10-13T18:06:45.291+0000"), \n',
' "_id" : ObjectId("59e100b535eeefcc27ee0802")\n',
' }, \n',
I tried all different method and solution reported on the read_json / https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
and load/ loads(str) etc. but nothing.
The Issue was with the format of the JSON file,
we tested with https://jsonformatter.curiousconcept.com/ and modify with a regular expression If you have better suggestions let me know.
import re
with open("views3.json", "r+") as read_file:
data = read_file.read()
x = re.sub("\w+\((.+)\)", r'\1', data)
print(x)
read_file.closed
do you want that?: use modul json for reading file.json
import pandas as pd
import json
with open('file.json') as viewsmc:
data = json.load(viewsmc)
print data #you have a dict
df = pd.DataFrame(data)
print(df)
Related
I have this JSON (I don't give you the whole thing because it's freaking long but you don't need the rest.)
cve" : {
"data_type" : "CVE",
"data_format" : "MITRE",
"data_version" : "4.0",
"CVE_data_meta" : {
"ID" : "CVE-2018-9991",
"ASSIGNER" : "cve#mitre.org"
},
"affects" : {
"vendor" : {
"vendor_data" : [ {
"vendor_name" : "frog_cms_project",
"product" : {
"product_data" : [ {
"product_name" : "frog_cms",
"version" : {
"version_data" : [ {
"version_value" : "0.9.5"
} ]
}
} ]
}
} ]
}
},
What I want to do is to print the vendor name of this cve.
So, what I did is :
with open("nvdcve-1.0-2018.json", "r") as file:
data = json.load(file)
increment = 0
number_cve = data["CVE_data_numberOfCVEs"]
while increment < int(number_cve):
print (data['CVE_Items'][increment]['cve']['CVE_data_meta']['ID'])
print (',')
print (data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data'][0]['vendor_name'])
print ("\n")
increment +=
The reason I did a while is because in the JSON file, there is a lot of CVEs, this is why I did data['CVE_Items'][increment]['cve'] (and this part works fine, the line `print (data['CVE_Items'][increment]['cve']['CVE_data_meta']['ID'] is working well).
My error is in the print (data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data'][0]['vendor_name']) line, python returns a list index out of range error.
But if I'm reading this JSON well, vendor_data is an array of 1 column so vendor_name is the ['vendor_data'][0]['vendor_name'] isn't it ?
The only way to parse the vendor_name i found is :
for value in data['CVE_Items'][a]['cve']['affects']['vendor']['vendor_data']:
print (value['vendor_name'])
instead of print (data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data'][0]['vendor_name'])
And doing a for just for one iteration is pretty disgusting :s, but at least, value is the data['CVE_Items'][a]['cve']['affects']['vendor']['vendor_data'][0] that I wanted....
Anyone knows something about it ?
Make sure every CVE_Item has an vender_data.
Example:
with open("nvdcve-1.0-2018.json", "r") as file:
data = json.load(file)
increment = 0
number_cve = data["CVE_data_numberOfCVEs"]
while increment < int(number_cve):
print (data['CVE_Items'][increment]['cve']['CVE_data_meta']['ID'])
print (',')
if (len(data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data']) > 0) :
print (data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data'][0]['vendor_name'])
print ("\n")
increment +=
Thanks to Ron Nabuurs' answer i found that all my vendor_data does not always have a vendor_name. So it is why the for works and not the print.
(the for check if the object is non null, else it stops).
So what I did is :
try:
print (data['CVE_Items'][increment]['cve']['affects']['vendor']['vendor_data'][0]['vendor_name'])
print (',')
except:
pass
I have JSON data in a file json_format.py as follows:
{
"name" : "ramu",
"place" : "hyd",
"height" : 5.10,
"list" : [1,2,3,4,5,6],
"tuple" : (0,1,2),
"colors" : {"mng":"white","aft" : "blue","night":"red"},
"car" : "None",
"bike" : "True",
}
I'm reading the above with this code:
import json
from pprint import pprint
with open (r'C:/PythonPrograms\Json_example/json_format.py') as jobj:
fp = jobj.readlines()
b = json.dumps(fp) # ---> I get string
print(type(b))
c = json.loads(b)
print(type(c)) # ---> List
pprint(c)
print(c[0])
pprint(c["name"])
Now, I would like to access the JSON object as c['name'] and the output should be ramu.
Since c is a list, I can't do so. How can I read my JSON data so that I can access it with keys?
Thanks in advance!
You're effectively doing c = json.loads(json.dumps(jobj.readlines())) when you just need:
c = json.load(jobj)
print(c["name"]) # ramu
Also, your JSON is malformed.
There are no tuples in JSON: "tuple" : (0,1,2),
Your last item should not end with a comma: "bike" : "True",
I extracted some data from a mongo database using the RMongo library. I have been working with the data with no problem. However, I need to access a field that was saved, originally in the database, as JSON. Since rmongodb saves the data as data frame, I now have a large character vector of length 1:
res1 = "[ { \"text\" : \"#Kayture Beyoncé jam session ?\" , \"name\" : \"beponcé \xed\xa0\xbc\xed\xbc\xbb\" , \"screenName\" : \"ColaaaaTweedy\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"108061963\"} , { \"text\" : \"#Kayture fucking marry me\" , \"name\" : \"George McQueen\" , \"screenName\" : \"GeorgeMcQueen12\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"67896750\"}]"
I need to extract all the "text" attributes of the objects from this array (there are 2 in this example), but I can not figure out a fast way. I was trying using strsplit, or going from character to json files using jsonlite, and then to list, but it does not work.
Any ideas?
Thanks!
Starting from
res1 = "[ { \"text\" : \"#Kayture Beyoncé jam session ?\" , \"name\" : \"beponcé \xed\xa0\xbc\xed\xbc\xbb\" , \"screenName\" : \"ColaaaaTweedy\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"108061963\"} , { \"text\" : \"#Kayture fucking marry me\" , \"name\" : \"George McQueen\" , \"screenName\" : \"GeorgeMcQueen12\" , \"follower\" : false , \"mentions\" : [ \"Kayture\"] , \"userTwitterId\" : \"67896750\"}]"
you can use fromJSON() from the jsonlite package to parse that JSON object.
library(jsonlite)
fromJSON(res1)
text name screenName follower mentions userTwitterId
1 #Kayture Beyoncé jam session ? beponcé í ¼í¼» ColaaaaTweedy FALSE Kayture 108061963
2 #Kayture fucking marry me George McQueen GeorgeMcQueen12 FALSE Kayture 67896750
Here is my pymongo call
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['somedb']
collection = db.some_details
pipe = [{'$group': {'_id': '$mvid', 'count': {'$sum': 1}}}]
TestOutput = db.collection.aggregate(pipeline=pipe)
print(list(TestOutput))
client.close()
For some reason resulting list is empty, while in Robomongo I get nonempty output.
Is formatting incorrect?
The exact Robomongo query is
db.some_details.aggregate([{$group: {_id: '$mvid', count: {$sum: 1}}}])
UPDATE
The output looks like
{
"result" : [
{
"_id" : "4f973d56a64facfaa7c3r4rf262ad5be695eef329aff7ab4610ddedfb8137427",
"count" : 84.0000000000000000
},
{
"_id" : "a134106e1a1551d296fu777cedc933e7df2d0a9bc5f41de047aba3ee29bace78",
"count" : 106.0000000000000000
},
],
"ok" : 1.0000000000000000
}
You are again adding db to collection otherwise code seems to be OK to me.
Here is modified version of your code :
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['somedb']
collection = db.some_details
pipe = [{'$group': {'_id': '$mvid', 'count': {'$sum': 1}}}]
# Notice the below line
TestOutput = collection.aggregate(pipeline=pipe)
print(list(TestOutput))
client.close()
I have a JSON file (an export from mongoDB) that I'd like to load into R. The document is about 890 MB in size with roughly 63,000 rows of 12 fields. The fields are numeric, character and date. I'd like to end up with a 63000 x 12 data frame.
lines <- readLines("fb2013.json")
result: jFile has all 63,000 elements in char class and all fields are lumped into one field.
Each file looks something like this:
"{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
Using rjson,
jFile <- fromJSON(paste(readLines("fb2013.json"), collapse=""))
only the first row is read into jFile but there are 12 fields.
Using RJSONIO:
jFile <- fromJSON(lines)
results in the following:
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
Again, only the first row is read into jFile and there are 12 fields.
The output from rjson and RJSONIO looks something like this:
$`_id`
[1] "1018535"
$comments_count
[1] 0
$created_at
$date
1.357027e+12
$icon
[1] "http://blah.gif"
$likes_count
[1] 20
$link
[1] "http://www.chachacha"
$message
[1] "I'd love to figure this out."
$page_category
[1] "Internet/software"
$page_id
[1] "3924395872345878534"
$page_name
[1] "Not Entirely Hopeless"
$type
[1] "photo"
$updated_at
$date
1.357027e+12
try
library(rjson)
path <- "WHERE/YOUR/JSON/IS/SAVED"
c <- file(path, "r")
l <- readLines(c, -1L)
json <- lapply(X=l, fromJSON)
Since you want a data.frame, try this:
# three copies of your sample...
line.1<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.2<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.3<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
x <- paste(line.1, line.2, line.3, sep="\n")
lines <- readLines(textConnection(x))
library(rjson)
# this is the important bit
df <- data.frame(do.call(rbind,lapply(lines,fromJSON)))
ncol(df)
# [1] 12
# finally, there's some cleaning up to do...
df$created_at
# [[1]]
# [[1]]$`$date`
# [1] 1.357942e+12
# ...
df$created_at <- as.POSIXlt(unname(unlist(df$created_at)/1000),origin="1970-01-01")
df$created_at
# [1] "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST"
df$updated_at <- as.POSIXlt(unname(unlist(df$updated_at)/1000),origin="1970-01-01")
Note that this conversion assumes that the dates were stored as milliseconds since the epoch.