I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry,
{
"_id" : { "$oid" : "506492073401d91fa7fdffbe" },
"Body" : "....",
"ViewCount" : 7351,
"LastEditorDisplayName" : "Rich B",
"Title" : ".....",
"LastEditorUserId" : 140328,
"LastActivityDate" : { "$date" : 1314819738077 },
"LastEditDate" : { "$date" : 1313882544213 },
"AnswerCount" : 12, "CommentCount" : 19,
"AcceptedAnswerId" : 7,
"Score" : 83,
"PostTypeId" : "question",
"OwnerUserId" : 8,
"Tags" : [ "c#", "winforms" ],
"CreationDate" : { "$date" : 1217540572667 },
"FavoriteCount" : 13, "Id" : 4,
"ForumName" : "stackoverflow.com"
}
Is there a way I can load a file where each line is one of the above into Pig without having to specify the schema by hand? Or perhaps a way to automatically generate a schema based on the (possibly nested) keys observed in all objects? If I do need to specify the schema by hand, what would the schema string look like?
Thanks!
The quick and easy way: use Twitter's elephantbird project. Inside is a loader called com.twitter.elephantbird.pig.load.JsonLoader. When used directly like so,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
B = FOREACH A GENERATE json#'fieldName' AS field_name;
nested elements won't be loaded. However, you can easily fix that (if desired) by changing it to,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
Including elephantbird is easy -- simply pull the the project "elephant-bird" with organization "com.twitter.elephantbird" using Maven (or equivalent's) dependency manager, then issuing the usual register command in pig
register 'lib/elephantbird.jar';
Related
In my database class we were given an assignment to work with two JSON files (add them to a mongodb atlas collection and query certain results)
Both JSON files had "errors" the first being :
{ "_id" : { "$oid" : "50b59cd75bed76f46522c34e" }, "student_id" : 0, "class_id" : 2, "scores" : [ { "type" : "exam", "score" : 57.92947112575566 }, { "type" : "quiz", "score" : 21.24542588206755 }, { "type" : "homework", "score" : 68.19567810587429 }, { "type" : "homework", "score" : 67.95019716560351 }, { "type" : "homework", "score" : 18.81037253352722 } ] }
and the second being :
{"_id":0,"name":"aimee Zank","scores":[{"score":1.463179736705023,"type":"exam"},{"score":11.78273309957772,"type":"quiz"},{"score":35.8740349954354,"type":"homework"}]},
{"_id":1,"name":"Aurelia Menendez","scores":[{"score":60.06045071030959,"type":"exam"},{"score":52.79790691903873,"type":"quiz"},{"score":71.76133439165544,"type":"homework"}]},
I fixed error 1 by removing the $oid and replacing it with just oid: as there was an error trying to add objects with $oid as a value to my collection. I also needed to add everything to an array.
I fixed the second by putting the entire object inside an array [].
When I asked my professor why these errors were in the JSON files and if it was on purpose, he said that they were there on for a reason and that we needed to find a "work around".
I am curious what work around there is to load JSON data that is incorrect into a collection? I am at a complete loss as to what he expected. Is there some way I can just load individual objects line by line from the JSON file to the collection?
This is how I loaded the JSON data after fixing the files directly:
const fs = require('fs');
var data = JSON.parse(fs.readFileSync("./students.json"));
JSON.stringify(data);
const database = "college";
const collection = "students";
use(database);
db.students.drop();
db.createCollection(collection);
db.students.insertMany(data);
--- All the importing of data should be done in VS Code and not using --mongodb import
And a side note that this assignment has since passed so I am not asking for help in completing my homework, simply trying to see if there was something I could of done that would not of required me to edit the JSON file itself. My professor has not responded to me regarding this question.
I have been given a text file, containing thousands of json documents (not ideal I know).
I need to put said documents into a mongodb collection.
So far, I have saved the text file as JSON and tried to mongoimport, added commas between each document and attempted mongorestore with a bson equivalent - all to no success
Here is an example of what is in the text file:
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
}
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
and so on...
Using mongoimport I get this error message:
Failed: invalid JSON input. Position: 16. Character: O
After saving as a BSON file, using mongorestore I also get this error:
Failed: db.collection: error restoring from file.bson: reading bson input: invalid BSONSize: 537534587 bytes
Any help would be greatly appreciated!
Let's say we have the following data in the file:
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
}
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
We need to refactor it to a code like below and save it with .js extension, say insert_data.js
db.collection.insertMany([
{
"_id" : ObjectId("78ahgodjaodj90231"),
"date" : ISODate("1970-01-01T00:00:00+0000),
"comment" : "Hello"
},
{
"_id" : ObjectId("99151gdsgag5464ah"),
"date" : ISODate("1970-01-02T00:00:00+0000),
"comment" : "World"
}
])
Finally run the following command:
mongo HOST:PORT/DB insert_data.js
I managed to import the documents successfully using Studio3T's import feature.
After renaming the text file to a JSON file, and letting Studio 3T validate the JSON before import, it worked perfectly.
Not the best solution, but it seemed to work for me.
I've been developing an app for about a year now, so I started it mid-2014 and have been upgrading ember.js and ember-cli as things move forward on those projects. I'm at Ember 1.11 now.
EDIT: Application Adapter
var ApplicationAdapter = DS.RESTAdapter.extend( {
namespace: 'api',
host: null,
setHost: Ember.on('init', function() {
set(this, 'host', this.container._registry.resolve('config:environment').API_ENDPOINT);
})
});
export default ApplicationAdapter;
My JSON API returns a main projects object, along with other sideloaded objects (like projectStatus). What I can't understand is, since I don't have any adapters or serializers that specify this, how I'm it's able to use the returned JSON, because it looks like this:
{
"projects" : {
"id": 4462875
"projectName" : "New business from Acme",
"projectDescription" : "Just another great project",
"pmlinks" : [ 1, 2],
"statusLinks" : [ 1440 ],
"commentsLinks" : [ 39 ]
},
"projectResources" : [ {
"id" : 1,
"name" : "Wile E. Coyote"
}, {
"id" : 2,
"name" : "Roadrunner"
}],
"projectComments" : [ {
"id" : 39,
"projectComment" : "started the project",
} ],
"projectStatuses" : [ {
"id" : 1440,
"status" : "G",
"trending" : "N",
"comment" : null,
"lastModifiedDate" : "2015-07-17T13:46:11.037+0000",
"project" : 4462875
} ],
}
I can't find anything in the Ember docs that recommend this "*Links" format for the relationships, and in fact it suggests using something more like status_ids. But the example it shows doesn't use _ids so I'm even more confused.
Here's a snippet of my project model:
statusUpdates: DS.hasMany('projectStatus'),
projectComments: DS.hasMany('projectComment'),
projectResources: DS.hasMany('projectResource'),
What I'm trying to figure out is with my new belongsTo relationship to schedule, how should the JSON be formatted from the API? It seems to work if the project object has a property like this "scheduleLinks": [10] but not if it's like "schedule": 10 or "schedule_id": 10 and that seems to be what the documentation says should work.
EDIT:
Maybe it's because the other objects like projectComments are named the way my model expects, and they're all returned at the same time from one API, that it doesn't even matter what the properties in the projects object is? Is that only to look up relationships if they're not all sideloaded?
The sideloading definitely is the thing here. Because if I change "statusLinks" to "projectStatus" then, since I have that relationship established in the project model, it will try to hit the API for that, e.g. /api/projectstatuses/:id.
So what seems to be happening is that the relationship is being hooked up from the belongsTo side implicitly; if the data is sideloaded, I don't need any links or IDs on the main object.
I am using RockMongo in Openshift to import a json file in MongoDB database. I exported directly the json from another MongoDB and I haven't changed anything. Here is a part of the json:
{ "_id" : "10352",
"author" : "8988607",
"country" : "...",
"views" : 1716,
"title" : "...",
"comments" : 1,
"likes" : 28,
"text" : "...",
"date" : { "$date" : 1278070740000 },
"approved" : "8480596" }
And I have this error message:
exception: field names cannot start with $ [$date] at src/mongo/shell/collection.js:147
As I said, I exported the json directly from another MongoDB. How can I solve this problem now?
I came up against this problem and my dba replaced the dollar sign with \uFF04 and that did the trick for us.
MongoDB uses its Extended JSON. Rockmongo likely uses a standard JSON parser, thus the mismatches.
Can you use the provided mongoimport application? You will need to use v2.4.0 or greater to include all the extended types see: SERVER-5675
I am doing some json loading with webGL, but the thing is that my file is a .json not a .js and the file starts like this :
{
"version" : "0.1.0",
"comment" : "Generated by MeshLab JSON Exporter",
"id" : 1,
"name" : "mesh",
"vertices" :
[
{
"name" : "position_buffer",
"size" : 3,
"type" : "float32",
"normalized" : false,
"values" :
[
-1.88373, -4.96699, -4.80969, -2.09061, -4.88318, -4.81713,
It does not look like the others .js that I have seen. So my thing is that I'd like to visualize it in a program like blender to check if it is a problem from the file.
But I did not find any programs.
And second is this file even supported by the webGL's jsonloader ?
This isn't simple json(like this http://learningwebgl.com/lessons/lesson14/Teapot.json) it's archive with a lot of stuff inside so you need to write your own (or find) parser.
About json loading read this http://learningwebgl.com/blog/?p=1658
The webGL's jsonloader also opens .js which you can make from a .obj with some python's script like the one from Three.js (Thanks to Mr.doob) :
https://github.com/mrdoob/three.js/blob/master/utils/exporters/obj/convert_obj_three.py
On the same git there is also loader for .obj.