Writing streamed JSON data to MongoDB using buffers

Writing streamed JSON data to MongoDB using buffers - json

I'm trying to write streamed JSON data via JSONStream to MongoDB. The stream is needed because the data can get very large (it can go up to tens of GBs), and I would like to use MongoDB's bulk write capability to further make the process faster. To do that, I would need to buffer the data, bulk-writing every 1000 JSON objects or so.
My problem is that when I buffer the writing of the data, it does not write all the data and leaves out the last few thousand objects. That is, if I try to write 100000 JSON objects, my code gets to write only 97000 of them. I have tried buffering both MongoDB bulk write and normal write with similar erroneous results.
My code:
var JSONStream = require('JSONStream');
var mongodb = require('mongodb');
// DB connect boilerplate here
var coll = database.collection('Collection');
var bulk = coll.initializeOrderedBulkOp();
var bufferSizeLimit = 1000;
var recordCount = 0;
var jsonStream = JSONStream.parse(['items', true]);
jsonStream.on('data', (data) => {
bulk.insert(data);
recordCount++;
// Write when bulk commands reach buffer size limit
if (recordCount % bufferSizeLimit == 0) {
bulk.execute((err, result) => {
bulk = coll.initializeOrderedBulkOp();
});
}
});
jsonStream.on('end', () => {
// Flush remaining buffered objects to DB
if (recordCount % bufferSizeLimit != 0) {
bulk.execute((err, result) => {
db.close();
});
}
});
If I substitute the buffered write code with a simple MongoDB insert, the code works properly. Is there anything I am missing here?

Related

Importing huge dataset from JSON to Cloud Firestrore

I'm trying to import a large JSON file(177k record) to cloud firestore, firstly I found the code below;
Uploading Code
var admin = require("firebase-admin");
var serviceAccount = require("./service_key.json");
admin.initializeApp({
credential: admin.credential.cert(serviceAccount),
databaseURL: "my service key"
});
const firestore = admin.firestore();
const path = require("path");
const fs = require("fs");
const directoryPath = path.join(__dirname, "files");
fs.readdir(directoryPath, function(err, files) {
if (err) {
return console.log("Unable to scan directory: " + err);
}
files.forEach(function(file) {
var lastDotIndex = file.lastIndexOf(".");
var menu = require("./files/" + file);
menu.forEach(function(obj) {
firestore
.collection('academicians2')
.add({ 'department': obj['department'], 'designation': obj['designation'], 'field': obj['field'], 'name': obj['name'], 'university': obj['university'], 'reviewList': [], 'rating': 0 })
.then(function(docRef) {
console.log("Document written");
})
.catch(function(error) {
console.error("Error adding document: ", error);
});
});
});
});
but after uploading 10-15k records started to giving errors,(Memory error I guess), So I decided to schedule cloud functions for every 1.2 seconds as timeout and batch write this JSON to firestore, but really have no idea how to get 499 rows for each run from my JSON.
Scheduled Cloud Function
/* eslint-disable */
const functions = require("firebase-functions");
const admin = require('firebase-admin');
const { user } = require("firebase-functions/lib/providers/auth");
admin.initializeApp();
const firestore = admin.firestore();
const userRef = admin.firestore().collection('academicians2');
exports.scheduledFunction = functions.pubsub.schedule('every 1.2 seconds').onRun((context) => {
//do i need to create for loop for batch or how can i approach to solve this problem
});

I would do something like this:
Make the scheduled function get 500 records at a time with a "start after" clause.
Perform a batch write to the db (batch writes are limited to 500 as you may know)
If successful, copy the last record (or a reference to the last record: ex: record's ID) of those 500 records into a document in your db. It can be a document called "upload_tracker" with a field called "last_uploaded".
On subsequent operations: the function queries that last_uploaded record from your db, then performs another operation starting AFTER that last record.
Notes:
. The scheduled function can write multiple batches before terminating if you want to finish quickly.
. In your Google Cloud Console / Cloud Functions, you may want to extend the function's timeout value to 9 minutes, if you know it's going to run for a long time.
. The document ID's should reflect your "record ID's" if you have them, to make sure there are no duplicates.

How to 'merge' multiple objects as one json object while reading data from csv file

I want to convert some csv file into json file in nodejs.
While, some of property in the json will be array. Right now I can read a csv file row by row like this:
{"price":1,"bedrooms":"Bedrooms (All Levels): 4"},
{"price":null,"bedrooms":"Bedrooms (Above Grade): 4"},
{"price":null,"bedrooms":"Master Bedroom: 21x15"},
{"price":null,"bedrooms":"Master Bedroom Bath: Full"},
{"price":null,"bedrooms":"2nd Bedroom: 14x13"},
{"price":null,"bedrooms":"3rd Bedroom: 15x14"},
{"price":null,"bedrooms":"4th Bedroom: 15x12"}
BUT I want to get something like this:
`{"price":1,"bedrooms":["Bedrooms (All Levels): 4","Bedrooms (Above
Grade): 4","Master Bedroom: 21x15","Master Bedroom Bath: Full","2nd
Bedroom: 14x13","3rd Bedroom: 15x14","4th Bedroom: 15x12"]}`
Can someone point out some ways? I tried things like fast-csv,csv-parse,ect. But couldn't merge(push or append) the values of the same field into one field as an array.
Thanks.
the code I finished right now:
var fs = require('fs');
var csv = require('fast-csv');
var stream = fs.createReadStream("../../HouseDataDev01.csv");
csv
.fromStream(stream, {columns:true, ignoreEmpty:true, headers :
["price","bedrooms"]})
.on("data", function(data){
// console.log(data);
})
.on("end", function(){
console.log("done");
});
==========
I came up with an idea that maybe I can create an object
var NewHouse = require('../models/NewHouse.js');
//NewHouse is a schema I created before to store the csv data
var test = new NewHouse;
So that I can use the test object something like this:
.on("data", function(data){
for(i in test){
test.i.push(data[index];
}
But I found there are many other properties in test like:$__reset, $__dirty, $__setSchema
How could I write this loop?

Ok, let me explain this...
The main point in my solution is to create some thing like headtag and fieldname{}to record the stream from fs which read csv row by row. I use the headtag to validate the round of the streaming rows. For example for the first round, I need the row's value to be the key of every object in my final json file. If I set a header in fromStream() method, each round's result will conatin the header, I dont know how to 'merge' them, so I chose this 'tricky' way.
Then, as in my final json file, some of(not all of) field will be array. So when I read a second value which is not an empty string "", I should convert the field into an array.usingreadResult[fieldnames[i]]=new Array( readResult[fieldnames[i]]);.
here is the code:
//create a fime stream
var stream = fs.createReadStream(csvfilepath);
//as the file will be read row by row, headtag is pointer to row.
//e.g: headtag = 5, means the stream is reading the 5th row of the csv file
var headtag=0;
//the final stream read result, will be the same format in schema.
var readResult={};
//fieldname records the headers key name.
var fieldnames={};
csv.fromStream(stream,{ignoreEmpty:true})
.on("data", function(data){
if(headtag === 0){
//I assume the first row is the headers,so should make sure the headers' names are the same in your schema
fieldnames = data;
for(var i=+0; i<data.length; i++){
readResult["data[i]"] = {};
}
}
if(headtag === 1){
//some of fields may only conatins one value, so save them as a String
for(var i=+0; i<data.length; i++){
readResult[fieldnames[i]] = data[i];
}
}
if(headtag === 2){
for(var i=+0 ; i<data.length; i++){
//for those field that may contains multiple values, convert them as an array, then push all values in it.
if(data[i]!==""){
readResult[fieldnames[i]]=new Array( readResult[fieldnames[i]]);
readResult[fieldnames[i]].push(data[i]);
}
}
}
if(headtag > 2){
for(var i=+0 ; i<data.length; i++){
if(data[i]!==""){
readResult[fieldnames[i]].push(data[i]);
}
}
}
headtag=headtag+1;
})
.on("end",function(){
readResult.images = images;
//create a time tag
var startdate = moment().format('MM/DD/YYYY');
var starttime = moment().format('H:mm:ss');
readResult.save_date=startdate;
readResult.save_time=starttime;
//save the data in mongodb
NewHouse.create(readResult, function(error, house){
if(error){
console.log(error);
}
else{
console.log("successfully create a document in your mongodb collection!");
}
});
});
On the basis of this question, I updated my code. Now you can read both csv file and images together and save them to mongodb.
for more information check here:
https://github.com/LarryZhao0616/csv_to_json_converter

MEAN Nodejs JSON.parse passing data from client to server

I am using MEAN stack and I am sending query parameters dynamically to my Nodejs server endpoints.
My client controller :
$http.get('/api/things',{params:{query:query}}).then(response => {
this.awesomeThings = response.data;
socket.syncUpdates('thing', this.awesomeThings);
});
where query is a value injected into the controller.
This is the server controller function (which works):
export function index(req, res) {
var query = req.query.query && JSON.parse(req.query.query)
Thing.find(query).sort({_id:-1}).limit(20).execAsync()
.then(respondWithResult(res))
.catch(handleError(res));
}
The above works but I am trying to understand the line
var query = req.query.query && JSON.parse(req.query.query)
as I have never seen this before( and I don't come from a programming background). I console.logged query and understand it's an object (which is required by Mongodb) but when I console.logged (JSON.parse(req.query.query)) or JSON.parse(query) to find out the final output, the program stops working with no error messages, very strange..
If someone can explain the above syntax and why it has to be done this way for it work, that would be much appreciated..
PS when I try to console log the JSON.parse like so, it fails to load even though it should have no effect whatsoever:
export function index(req, res) {
var query = req.query.query && JSON.parse(req.query.query)
var que = JSON.parse(req.query.query)
Thing.find(query).sort({_id:-1}).limit(20).execAsync()
.then(respondWithResult(res))
.catch(handleError(res));
console.log("que"+que)
}

function one() {
var x = {};
var res = JSON.parse(x.y);
console.log(res);
}
function two() {
var x = {};
var res = x.y && JSON.parse(x.y);
console.log(res);
}
<button onclick="one()">ERROR</button>
<button onclick="two()">NO ERROR</button>
var x = data && JSON.parse(data);
Since expression is evaluated from left, first data is evaulated.
If it is undefined then, the next part -> JSON.parse() is not performed.
On the other hand, if data is defined parse is tried and the result is returned and stored in x.
Main advantage here is the parse doesn't run if the variable wasn't defined.
it could be equivalent to saying:
if(data) {x = JSON.parse(x);}

JSON streaming with Oboe.js, MongoDB and Express.js

I'm experimenting with JSON streaming through HTTP with Oboe.js, MongoDB and Express.js.
The point is to do a query in MongoDB (Node.js's mongodb native drive), pipe it (a JavaScript array) to Express.js and parse it in the browser with Oboe.js.
The benchmarks I did compared streaming vs. blocking in both the MongoDB query server-side and the JSON parsing in the client-side.
Here is the source code for the two benchmarks. The first number is the number of milli-seconds for 1000 queries of 100 items (pagination) in a 10 million documents collection and the second number between parenthesis, represents the number of milli-seconds before the very first item in the MongoDB result array is parsed.
The streaming benchmark server-side:
// Oboe.js - 20238 (16.887)
// Native - 16703 (16.69)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.stream()
.pipe(JSONStream.stringify())
.pipe(res);
The blocking benchmark server-side:
// Oboe.js - 17418 (14.267)
// Native - 13706 (13.698)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.toArray(function (e, docs) {
res.json(docs);
});
These results really surprise me because I would have thought that:
Streaming would be quicker than blocking every single time.
Oboe.js would be quicker to parse the entire JSON array compared to the native JSON.parse method.
Oboe.js would be quicker to parse the first element in the array compared to the native JSON.parse method.
Does anyone have an explanation ?
What am I doing wrong ?
Here is the source-code for the two client-side benchmarks too.
The streaming benchmark client-side:
var limit = 100;
var max = 1000;
var oboeFirstTimes = [];
var oboeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> OBOE.js time:', (Date.now() - oboeStart));
console.log('> OBOE.js avg. first time:', (
oboeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
oboe('/api/spdy-stream?offset=' + offset + '&limit=' + limit)
.node('![*]', function () {
if (first) {
first = false;
oboeFirstTimes.push(Date.now() - parseStart);
}
})
.done(function () {
paginate(i + 1, offset + limit, limit);
});
}
paginate(0, 0, limit);
The blocking benchmark client-side:
var limit = 100;
var max = 1000;
var nativeFirstTimes = [];
var nativeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> NATIVE time:', (Date.now() - nativeStart));
console.log('> NATIVE avg. first time:', (
nativeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
var req = new XMLHttpRequest();
req.open('GET', '/api/spdy-stream?offset=' + offset + '&limit=' + limit, true);
req.onload = function () {
var json = JSON.parse(req.responseText);
json.forEach(function () {
if (first) {
first = false;
nativeFirstTimes.push(Date.now() - parseStart);
}
});
paginate(i + 1, offset + limit, limit);
};
req.send();
}
paginate(0, 0, limit);
Thanks in advance !

I found those comments in Oboe doc at the end of the "Why Oboe?" section:
Because it is a pure Javascript parser, Oboe.js requires more CPU time than JSON.parse. Oboe.js works marginally more slowly for small messages that load very quickly but for most real-world cases using i/o effectively beats optimising CPU time.
SAX parsers require less memory than Oboe’s pattern-based parsing model because they do not build up a parse tree. See Oboe.js vs SAX vs DOM.
If in doubt, benchmark, but don’t forget to use the real internet, including mobile, and think about perceptual performance.

Trying to interpret the Node-Neo4j API

I'm pretty new to coding so forgive me if my code is unreadable or my question simplistic.
I am trying to create a little server application that (amongst other things) displays the properties of a neo4j node. I am using node.js, Express and Aseem Kishore's Node-Neo4j REST API client, the documentation for which can be found here.
My question stems from my inability to fetch the properties of nodes and paths. I can return a node or path, but they seem to be full of objects with which I cannot interact. I poured through the API documents looking for some examples of how particular methods are called but I found nothing.
Ive been trying to call the #toJSON method like, "db.toJSON(neoNode);" but it tells me that db does not contain that method. I've also tried, "var x = neoNode.data" but it returns undefined.
Could someone please help me figure this out?
//This file accepts POST data to the "queryanode" module
//and sends it to "talkToNeo" which queries the neo4j database.
//The results are sent to "resultants" where they are posted to
//a Jade view. Unfortuantly, the data comes out looking like
// [object Object] or a huge long string, or simply undefined.
var neo4j = require('neo4j');
var db = new neo4j.GraphDatabase('http://localhost:7474');
function resultants(neoNode, res){
// if I console.log(neoNode) here, I now get the 4 digit integer
// that Neo4j uses as handles for nodes.
console.log("second call of neoNode" + neoNode);
var alpha = neoNode.data; //this just doesn't work
console.log("alpha is: " +alpha); //returns undefined
var beta = JSON.stringify(alpha);
console.log("logging the node: ");
console.log(beta);// still undefined
res.render("results",{path: beta});
res.end('end');
}
function talkToNeo (reqnode, res) {
var params = {
};
var query = [
'MATCH (a {xml_id:"'+ reqnode +'"})',
'RETURN (a)'
].join('\n');
console.log(query);
db.query(query, params, function (err, results) {
if (err) throw err;
var neoNode = results.map(function (result){
return result['a']; //this returns a long string, looks like an array,
//but the values cannot be fetched out
});
console.log("this is the value of neoNode");
console.log(neoNode);
resultants(neoNode, res);
});
};
exports.queryanode = function (req, res) {
console.log('queryanode called');
if (req.method =='POST'){
var reqnode = req.body.node; //this works as it should, the neo4j query passes in
talkToNeo(reqnode, res) //the right value.
}
}
EDIT
Hey, I just wanted to answer my own question for anybody googling node, neo4j, data, or "How do I get neo4j properties?"
The gigantic object from neo4j, that when you stringified it you got all the "http://localhost:7474/db/data/node/7056/whatever" urls everywhere, that's JSON. You can query it with its own notation. You can set a variable to the value of a property like this:
var alpha = unfilteredResult[0]["nodes(p)"][i]._data.data;
Dealing with this JSON can be difficult. If you're anything like me, the object is way more complex than any internet example can prepare you for. You can see the structure by putting it through a JSON Viewer, but the important thing is that sometimes there's an extra, unnamed top layer to the object. That's why we call the zeroth layer with square bracket notation as such: unfilteredResult[0] The rest of the line mixes square and dot notation but it works. This is the final code for a function that calculates the shortest path between two nodes and loops through it. The final variables are passed into a Jade view.
function talkToNeo (nodeone, nodetwo, res) {
var params = {
};
var query = [
'MATCH (a {xml_id:"'+ nodeone +'"}),(b {xml_id:"' + nodetwo + '"}),',
'p = shortestPath((a)-[*..15]-(b))',
'RETURN nodes(p), p'
].join('\n');
console.log("logging the query" +query);
db.query(query, params, function (err, results) {
if (err) throw err;
var unfilteredResult = results;
var neoPath = "Here are all the nodes that make up this path: ";
for( i=0; i<unfilteredResult[0]["nodes(p)"].length; i++) {
neoPath += JSON.stringify(unfilteredResult[0]['nodes(p)'][i]._data.data);
}
var pathLength = unfilteredResult[0].p._length;
console.log("final result" + (neoPath));
res.render("results",{path: neoPath, pathLength: pathLength});
res.end('end');
});
};

I would recommend that you look at the sample application, which we updated for Neo4j 2.0
Which uses Cypher to load the data and Node-labels to model the Javascript types.
You can find it here: https://github.com/neo4j-contrib/node-neo4j-template
Please ask more questions after looking at this.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Writing streamed JSON data to MongoDB using buffers - json

Related

Importing huge dataset from JSON to Cloud Firestrore

How to 'merge' multiple objects as one json object while reading data from csv file

MEAN Nodejs JSON.parse passing data from client to server

JSON streaming with Oboe.js, MongoDB and Express.js

Trying to interpret the Node-Neo4j API

Categories

Resources