How to open a .mongo file and export the content into csv? - json

EDIT 2014-05-01: I tried fromJSON first (as suggested below), but that only parsed the first line. I found out that there were commas missing between the brackets of each JSON line so I changed that in TextEdit and saved the file. I also added [ at the beginning of the file and ] at the end and then it worked with JSON. Now the next step: from a list (with embedded lists) to a dataframe (or csv).
I get a data package from edX every now and then on the courses we are evaluating. Some of these are just plain .csv files which are quite easy to handle, others are more difficult for me (not having a CS or programming background).
I have 2 files I want to open and parse into csv files for analysis in R. I have tried many many json2csv tools out there, but to no avail. I also tried the simple methods described here to turn json into csv.
The data is confidential, so I cannot share the entire data set, but will share the first two lines of the file, maybe that helps. The problem is that nowhere I find anything about .mongo files, which to me seems quite strange, do they even exist? Or is this just a JSON file that may be corrupted (which could explain the errors)?
Any suggestions are welcome.
The first 2 lines in one of the .mongo files:
{
"_id": {
"$oid": "52d1e62c350e7a3156000009"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
],
"at_position_list": [
],
"body": "the delft university accredited course with the scholarship (fundamentals of water treatment) is supposed to start in about a month's time. But have the scholarship list been published? Any tentative date??",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"author_id": "269835",
"comment_thread_id": {
"$oid": "52cd40c5ab40cf347e00008d"
},
"author_username": "tachak59",
"sk": "52d1e62c350e7a3156000009",
"updated_at": {
"$date": 1389487660636
},
"created_at": {
"$date": 1389487660636
}
}{
"_id": {
"$oid": "52d0a66bcb3eee318d000012"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "52c63278100c07c0d1000028"
}
],
"at_position_list": [
],
"body": "I got it. Thank you!",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "52c63278100c07c0d1000028"
},
"author_id": "2655027",
"comment_thread_id": {
"$oid": "52c4f303b03c4aba51000013"
},
"author_username": "dmoronta",
"sk": "52c63278100c07c0d1000028-52d0a66bcb3eee318d000012",
"updated_at": {
"$date": 1389405803386
},
"created_at": {
"$date": 1389405803386
}
}{
"_id": {
"$oid": "52ceea0cada002b72c000059"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "5287e8d5906c42f5aa000013"
}
],
"at_position_list": [
],
"body": "if u please send by mail \n",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "5287e8d5906c42f5aa000013"
},
"author_id": "2276302",
"comment_thread_id": {
"$oid": "528674d784179607d0000011"
},
"author_username": "totah1993",
"sk": "5287e8d5906c42f5aa000013-52ceea0cada002b72c000059",
"updated_at": {
"$date": 1389292044203
},
"created_at": {
"$date": 1389292044203
}
}

R doesn't have "native" support for these files but there is a JSON parser with the rjson package. So I might load my .mongo file with:
myfile <- "path/to/myfile.mongo"
myJSON <- readLines(myfile)
myNiceData <- fromJSON(myJSON)
Since RJson converts into a data structure that fits the object being read, you'll have to do some additional snooping but once you have an R data type you shouldn't have any trouble working with it from there.
Another package to consider when parsing JSON data is jsonlite. It will make data frames for you so you can write them to a csv format with write.table or some other applicable method for writing objects.
NOTE: if it is easier to connect to the MongoDB and get the data from a request, then RMongo may be a good bet. The R-Bloggers also made a post about using RMongo that has a nice little walkthrough.

I used RJSON as suggested by #theWanderer and with the help of a colleague wrote the following code to parse the data into columns, choosing the specific columns that are needed, and checking each of the instances if they return the right variables.
Entire workflow:
Checked some of the data in jsonlint - corrected the errors → },{ instead of }{ between each line and [ and ] at the beginning and end of the file
Made a smaller file to play with, containing about 11 JSON lines
Used the code below to parse the datafile - however, checking the different listItems first if they are not lists themselves (that gives problems) // as you will see, I also removed things like \n because that gave errors and added an empty value for parent_id if there is none in the data (otherwise it would mix up the data)
The code to import the .mongo file into R and then parse it into CSV:
library(rjson)
###### set working directory to write out the data file
setwd("/your/favourite/dir/json to csv/")
#never ever convert strings to factors
options(stringsAsFactors = FALSE)
#import the .mongo file to R
temp.data = fromJSON(file="temp.mongo", method="C", unexpected.escape="error")
file.remove("temp.csv") ## removes the old datafile if there is one
## (so the data is not appended to the file,
## but a new file is created)
listItem = temp.data[[1]] ## prepare the listItem the first time
for (listItem in temp.data){
parent_id = ""
if (length(listItem$parent_id)>0){
parent_id = listItem$parent_id
}
write.table(t(c(
listItem$votes$up_count, listItem$visible, parent_id,
gsub("\n", "", listItem$body), listItem$course_id, unlist(listItem["_type"]),
listItem$endorsed, listItem$anonymous, listItem$author_id,
unlist(listItem$comment_thread_id), listItem$author_username,
as.POSIXct(unlist(listItem$created_at)/1000, origin="1970-01-01"))), # end t(), c()
file="temp.csv", sep="\t", append=TRUE, row.names=FALSE, col.names=FALSE)
}

Related

JMESPath how to write a query with multi-level filter?

I have been studying official documentation of JMESPath and a few other resources. However I was not successful with the following task:
my data structure is a json from vimeo api (video list):
data array contains lots of objects, each object is the uploaded file that has many attributes and various options.
"data": [
{
"uri": "/videos/00001",
"name": "Video will be added.mp4",
"description": null,
"type": "video",
"link": "https://vimeo.com/00001",
"duration": 9,
"files":[
{
"quality": "hd",
"type": "video/mp4",
"width": 1440,
"height": 1440,
"link": "https://player.vimeo.com/external/4443333.sd.mp4",
"created_time": "2020-09-01T19:10:01+00:00",
"fps": 30,
"size": 10807854,
"md5": "643d9f18e0a63e0630da4ad85eecc7cb",
"public_name": "UHD 1440p",
"size_short": "10.31MB"
},
{
"quality": "sd",
"type": "video/mp4",
"width": 540,
"height": 540,
"link": "https://player.vimeo.com/external/44444444.sd.mp4",
"created_time": "2020-09-01T19:10:01+00:00",
"fps": 30,
"size": 1345793,
"md5": "cb568939bb7b276eb468d9474c1f63f6",
"public_name": "SD 540p",
"size_short": "1.28MB"
},
... other data
]
},
... other uploaded files
]
Filter I need to apply is that duration needs to be less than 10 and width of file needs to be 540 and the result needs to contain a link (url) from files
I have managed to get only one of structure-levels working:
data[].files[?width == '540'].link
I need to extract this kind of list
[
{
"uri": "/videos/111111",
"link": "https://player.vimeo.com/external/4123112312.sd.mp4"
},
{
"uri": "/videos/22222",
"link": "https://player.vimeo.com/external/1231231231.sd.mp4"
},
...other data
]
Since the duration is in your data array, you will have to add this filter at that level.
You will also have to use what is described under the section filtering and selecting nested data because you only care of one specific type of file under the files array, so, you can use the same type of query structure | [0] in order to pull only the first element of the filtered files array.
So on your reduced exemple, the query:
data[?duration < `10`].{ uri: uri, link: files[?width == `540`].link | [0] }
Would yield the expected:
[
{
"uri": "/videos/00001",
"link": "https://player.vimeo.com/external/44444444.sd.mp4"
}
]

Python error: Extra data: line 1 in loading a big Json file

I am trying to read a JSON file which is 370 MB
import json
data = open( "data.json" ,"r")
json.loads(data.read())
and it's not possible to easily find the root cause of the following error,
json.decoder.JSONDecodeError: Extra data: line 1 column 1024109 (char 1024108)
I looked at similar questions and tried the following StackOverflow answer
import json
data = [json.loads(line) for line in open('data.json', 'r')]
But it didn't resolve the issue. I am wondering if there is any solution to find where the error happens in the file. I am getting some other files from the same source and they run without any problem.
A small piece of the Json file is a list of dicts like,
{
"uri": "p",
"source": {
"uri": "dail",
"dataType": "pr",
"title": "Daily"
},
"authors": [
{
"type": "author",
"isAgency": false
}
],
"concepts": [
{
"amb": false,
"imp": true,
"date": "2019-05-23",
"textStart": 2459,
"textEnd": 2467
},
{
"amb": false,
"imp": true,
"date": "2019-05-09",
"textStart": 2684,
"textEnd": 2691
}
],
"shares": {},
"wgt": 100,
"relevance": 100
}
The problem with json library is loaded everything to memory and parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.
Instead I would suggest to take a look at https://github.com/henu/bigjson
import bigjson
with open('data.json', 'rb') as f:
json_data = bigjson.load(f)

Why is JSON being detected as empty?

So I have a JSON file I got from Postman which is returning as an empty object. This is how I'm reading it.
import regscooter from './json_files/reginald_griffin_scooter.json'
const scoot = regscooter;
const CustomerPage = () => {...}
reginald_griffin_scooter.json
{
"success": true,
"result": {
"id": "hhhhhhhhhh",
"model": "V1 Scooter",
"name": "hhhhhhhhhh",
"status": "active",
"availabilityStatus": "not-available",
"availabilityTrackingOn": true,
"serial": "hhhhhhhhhhhh",
"createdByUser": "hhhhhhhhK",
"createdByUsername": "hhhhhhhh",
"subAssets": [
"F0lOjWBAnG"
],
"parts": [
"hhhhhhhh"
],
"assignedCustomers": [
"hhhhhhhhh"
],
"createdAt": "2019-12-03T21:47:26.218Z",
"updatedAt": "2020-06-26T22:05:54.526Z",
"customFieldsAsset": [
{
"id": "hhhhhhh",
"name": "MAC",
"value": "hhhhhhhh",
"asset": "hhhhhhhhhh",
"user": "hhhhhhhhh",
"createdAt": "2019-12-03T21:47:26.342Z",
"updatedAt": "2019-12-11T16:29:24.732Z"
},
{
"id": "hhhhhhhh",
"name": "IMEI",
"value": "hhhhhhh",
"asset": "hhhhhhh",
"user": "hhhhhhhhhh",
"createdAt": "2019-12-03T21:47:26.342Z",
"updatedAt": "2019-12-11T16:29:24.834Z"
},
{
"id": "hhhhhhhhh",
"name": "Key Number",
"value": "NA",
"asset": "hhhhhhhhh",
"user": "hhhhhhhhhhh",
"createdAt": "2019-12-03T21:47:26.342Z",
"updatedAt": "2019-12-11T16:29:24.911Z"
}
]
}
}
The error is that "const scoot" is being shown as an empty object {}. I made sure to save a ton of times everywhere. I am able to read through the imported JSON file in other variables in similar ways, so I don't know why I can't parse this one. I just want to access the JSON object inside this. Also I omitted some information with hhhhh because of confidentiality.
EDIT: The code works, but it still has a red line beneath result when I do:
const scoot = regscooter.result.id;
It would be much more effective if you will provide an example in codesandbox or so.
However at first look it might be a parser issue ( maybe you are using Webpack with missing configuration for parsing files with json extension ), meaning we need more information to provide you with a full answer ( maybe solution ? ).
Have you tried to do the next:
const scoot = require('./json_files/reginald_griffin_scooter.json');

Read first line of huge Json file with Spark using Pyspark

I'm pretty new to Spark and to teach myself I have been using small json files, which work perfectly. I'm using Pyspark with Spark 2.2.1 However I don't get how to read in a single data line instead of the entire json file. I have been looking for documentation on this but it seems pretty scarce. I have to process a single large (larger than my RAM) json file (wikipedia dump: https://archive.org/details/wikidata-json-20150316) and want to do this in chuncks or line by line. I thought Spark was designed to do just that but can't find out how to do it and when I request the top 5 observations in a naive way I run out of memory. I have tried RDD .
SparkRDD= spark.read.json("largejson.json").rdd
SparkRDD.take(5)
and Dataframe
SparkDF= spark.read.json("largejson.json")
SparkDF.show(5,truncate = False)
So in short:
1) How do I read in just a fraction of a large JSON file? (Show first 5 entries)
2) How do I filter a large JSON file line by line to keep just the required results?
Also: I don't want to predefine the datascheme for this to work.
I must be overlooking something.
Thanks
Edit: With some help I have gotten a look at the first observation but it by itself is already too huge to post here so I'll just put a fraction of it here.
[
{
"id": "Q1",
"type": "item",
"aliases": {
"pl": [{
"language": "pl",
"value": "kosmos"
}, {
"language": "pl",
"value": "\\u015bwiat"
}, {
"language": "pl",
"value": "natura"
}, {
"language": "pl",
"value": "uniwersum"
}],
"en": [{
"language": "en",
"value": "cosmos"
}, {
"language": "en",
"value": "The Universe"
}, {
"language": "en",
"value": "Space"
}],
...etc
That's very similar to Select only first line from files under a directory in pyspark
Hence something like this should work :
def read_firstline(filename):
with open(filename, 'rb') as f:
return f.readline()
# files is a list of filenames
rdd_of_firstlines = sc.parallelize(files).flatMap(read_firstline)

Reading JSON file content in LINUX

Dears,
Can someone help me out reading the content of JSON file in LINUX machine without using JQ, Python & Ruby. Looking for purely SHELL scripting. We need to iterate the values if multiple records found. In the below case we have 2 set of records, which needs to iterated.
{
"version": [
"sessionrestore",1
],
"windows": [
{
"tabs": [
{
"entries": [
{
"url": "http://orf.at/#/stories/2.../" (http://orf.at/#/stories/2.../%27) ,
"title": "news.ORF.at",
"charset": "UTF-8",
"ID": 9588,
"docshellID": 298,
"docIdentifier": 10062,
"persist": true
},
{
"url": "http://oracle.at/#/stories/2.../" (http://oracle.at/#/stories/2.../%27) ,
"title": "news.at",
"charset": "UTF-8",
"ID": 9589,
"docshellID": 288,
"docIdentifier": 00062,
"persist": false
}
]
}
}
}