Workaround to add JSON with errors to mongodb atlas collection - json

In my database class we were given an assignment to work with two JSON files (add them to a mongodb atlas collection and query certain results)
Both JSON files had "errors" the first being :
{ "_id" : { "$oid" : "50b59cd75bed76f46522c34e" }, "student_id" : 0, "class_id" : 2, "scores" : [ { "type" : "exam", "score" : 57.92947112575566 }, { "type" : "quiz", "score" : 21.24542588206755 }, { "type" : "homework", "score" : 68.19567810587429 }, { "type" : "homework", "score" : 67.95019716560351 }, { "type" : "homework", "score" : 18.81037253352722 } ] }
and the second being :
{"_id":0,"name":"aimee Zank","scores":[{"score":1.463179736705023,"type":"exam"},{"score":11.78273309957772,"type":"quiz"},{"score":35.8740349954354,"type":"homework"}]},
{"_id":1,"name":"Aurelia Menendez","scores":[{"score":60.06045071030959,"type":"exam"},{"score":52.79790691903873,"type":"quiz"},{"score":71.76133439165544,"type":"homework"}]},
I fixed error 1 by removing the $oid and replacing it with just oid: as there was an error trying to add objects with $oid as a value to my collection. I also needed to add everything to an array.
I fixed the second by putting the entire object inside an array [].
When I asked my professor why these errors were in the JSON files and if it was on purpose, he said that they were there on for a reason and that we needed to find a "work around".
I am curious what work around there is to load JSON data that is incorrect into a collection? I am at a complete loss as to what he expected. Is there some way I can just load individual objects line by line from the JSON file to the collection?
This is how I loaded the JSON data after fixing the files directly:
const fs = require('fs');
var data = JSON.parse(fs.readFileSync("./students.json"));
JSON.stringify(data);
const database = "college";
const collection = "students";
use(database);
db.students.drop();
db.createCollection(collection);
db.students.insertMany(data);
--- All the importing of data should be done in VS Code and not using --mongodb import
And a side note that this assignment has since passed so I am not asking for help in completing my homework, simply trying to see if there was something I could of done that would not of required me to edit the JSON file itself. My professor has not responded to me regarding this question.

Related

handling a well-formed JSON file of an array of objects

A JSON string string passes the jsonlint test.
response = [
{
"article" : {
"info" : {
"initial" : {
"articleIds" : [
"7461221587662919569"
],
}
},
"text" : "where they would 'transfer to' next.",
"lang" : "en",
}
},
{
"article" : {
"info" : {
"initial" : {
"articleIds" : [
"6613144915874808065"
],
}
},
"text" : "produto regional.",
"lang" : "pt"
}
}
]
However, after processing
require 'json'
file = File.read('/Users/main/jugg//article_samples.js')
data_hash = JSON.parse(file)
One is left with an array, whereas more frequently a hash with a name labels a subsequent array, where one works with that nomenclature such as response['data']
But in this case the array is not accessible via response[0]. How can this be considered as an array in order to process each individual element collection.each do |member|?
A curiosity: data_hash.class => NilClass
The response = ... code from article_samples.js is JavaScript, not JSON. This initializes a variable named response with a JavaScript array.
To use this as JSON, then rename the file to article_samples.json and remove response = from the file. The first line should start with [.
Now your second block of code should work just fine as long as the article_samples.json file is in the correct path.
On a side note, I suggest that you find a way to make the path more flexible. The way you have it currently hard coded is tied directly to your current machine's file system. This won't work if you want to run this code from another machine because the folder /Users/main/jugg probalby won't exist.
If this is a web server with ruby on rails, then one solution is to create an environment variable with the path where this file is stored.

Restoring a MongoDB collection from a text file of json documents

I have been given a text file, containing thousands of json documents (not ideal I know).
I need to put said documents into a mongodb collection.
So far, I have saved the text file as JSON and tried to mongoimport, added commas between each document and attempted mongorestore with a bson equivalent - all to no success
Here is an example of what is in the text file:
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
}
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
and so on...
Using mongoimport I get this error message:
Failed: invalid JSON input. Position: 16. Character: O
After saving as a BSON file, using mongorestore I also get this error:
Failed: db.collection: error restoring from file.bson: reading bson input: invalid BSONSize: 537534587 bytes
Any help would be greatly appreciated!
Let's say we have the following data in the file:
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
}
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
We need to refactor it to a code like below and save it with .js extension, say insert_data.js
db.collection.insertMany([
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
},
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
])
Finally run the following command:
mongo HOST:PORT/DB insert_data.js
I managed to import the documents successfully using Studio3T's import feature.
After renaming the text file to a JSON file, and letting Studio 3T validate the JSON before import, it worked perfectly.
Not the best solution, but it seemed to work for me.

tMongoDBOutput - Configuring JSON tree

Trying to configure JSON tree for tMongoDBOutput. Only 1 element is getting created in subelement array. Can someone please give a example of configuring the JSON tree. Requirement is one document can have multiple nested sub documents. Medical_records can have multiple sub documents, but only 1 sub document is getting created now skipping the rest.
Resulting JSON in MongoDB as follows
{
"first_name" : "testname",
"middle_name" : [],
"last_name" : "test",
"medical_records" : [
{
"dateofuploading" : "2016-09-29 12:49:21.5",
"filename" : "demo.pdf",
"isautogenerated" : "1",
"recordid" : "123"
}
]
}
enter image description hereIf you want to have multiple sub documents in array,You need to use the group by operation in the advance settings for the element inside the array .

SyntaxError: Unexpected token )

I am posting this because I have not seen this exact question before, and I have had no luck going through previous posts.
I am creating a layout of an application called Exhibit, one that lays out my data on a timeline. The html code is structured for Exhibit.
My data is stored in a JSON file. I have checked this with JLint and it seems to be in the correct format. Yet I am thrown the above error regarding my JSON file.
Here is one object from my JSON file.
{
"items" : [
{
"url" : "http:\/\/twitter.com\/acarvin\/statuses\/32815014167445504",
"uri" : "file:\/\/\/C:\/Users\/david\/Documents\/Work\/Exhibit\/CAR\/item#%40acarvin%3A%20AlJaz%20showing%20huge%20crowds%20rushing%20down%20a%20Cairo%20street.%20\'It%20is%20an%20intense%20battle%20here.\'%20%23jan25",
"time" : "2011-02-02 14:58:03",
"date" : "2005",
"action" : "reporting",
"hour" : "14:58:03",
"role" : "reporter",
"username" : "acarvin",
"keywords" : [
"crowd",
" battle",
" al jazeera"
],
"ignoretime" : "2\/2\/2011 14:58:03",
"type" : "Item",
"label" : "#acarvin: AlJaz showing huge crowds rushing down a Cairo street. \'It is an intense battle here.\' #jan25",
"gender" : "male",
"location" : "talaat harb",
"origin" : "file:\/\/\/C:\/Users\/david\/Documents\/Work\/Exhibit\/CAR\/hands-on.html#%40acarvin%3A%20AlJaz%20showing%20huge%20crowds%20rushing%20down%20a%20Cairo%20street.%20\'It%20is%20an%20intense%20battle%20here.\'%20%23jan25"
}
]
}
Can anyone see what may be happening?
note: I specified the type of my data as application/json when I called it.
There are several invalid escapes in your strings in the form \'. While those are valid in JavaScript strings (whether single-quoted or double-quoted), they are not valid JSON. In JSON, a ' is just a '.
With those in place, the string will not validate. With the extraneous \ removed, it will. (I used http://jsonlint.com to confirm.)

Loading Raw JSON into Pig

I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry,
{
"_id" : { "$oid" : "506492073401d91fa7fdffbe" },
"Body" : "....",
"ViewCount" : 7351,
"LastEditorDisplayName" : "Rich B",
"Title" : ".....",
"LastEditorUserId" : 140328,
"LastActivityDate" : { "$date" : 1314819738077 },
"LastEditDate" : { "$date" : 1313882544213 },
"AnswerCount" : 12, "CommentCount" : 19,
"AcceptedAnswerId" : 7,
"Score" : 83,
"PostTypeId" : "question",
"OwnerUserId" : 8,
"Tags" : [ "c#", "winforms" ],
"CreationDate" : { "$date" : 1217540572667 },
"FavoriteCount" : 13, "Id" : 4,
"ForumName" : "stackoverflow.com"
}
Is there a way I can load a file where each line is one of the above into Pig without having to specify the schema by hand? Or perhaps a way to automatically generate a schema based on the (possibly nested) keys observed in all objects? If I do need to specify the schema by hand, what would the schema string look like?
Thanks!
The quick and easy way: use Twitter's elephantbird project. Inside is a loader called com.twitter.elephantbird.pig.load.JsonLoader. When used directly like so,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
B = FOREACH A GENERATE json#'fieldName' AS field_name;
nested elements won't be loaded. However, you can easily fix that (if desired) by changing it to,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
Including elephantbird is easy -- simply pull the the project "elephant-bird" with organization "com.twitter.elephantbird" using Maven (or equivalent's) dependency manager, then issuing the usual register command in pig
register 'lib/elephantbird.jar';