JSON streaming with Oboe.js, MongoDB and Express.js - json

I'm experimenting with JSON streaming through HTTP with Oboe.js, MongoDB and Express.js.
The point is to do a query in MongoDB (Node.js's mongodb native drive), pipe it (a JavaScript array) to Express.js and parse it in the browser with Oboe.js.
The benchmarks I did compared streaming vs. blocking in both the MongoDB query server-side and the JSON parsing in the client-side.
Here is the source code for the two benchmarks. The first number is the number of milli-seconds for 1000 queries of 100 items (pagination) in a 10 million documents collection and the second number between parenthesis, represents the number of milli-seconds before the very first item in the MongoDB result array is parsed.
The streaming benchmark server-side:
// Oboe.js - 20238 (16.887)
// Native - 16703 (16.69)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.stream()
.pipe(JSONStream.stringify())
.pipe(res);
The blocking benchmark server-side:
// Oboe.js - 17418 (14.267)
// Native - 13706 (13.698)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.toArray(function (e, docs) {
res.json(docs);
});
These results really surprise me because I would have thought that:
Streaming would be quicker than blocking every single time.
Oboe.js would be quicker to parse the entire JSON array compared to the native JSON.parse method.
Oboe.js would be quicker to parse the first element in the array compared to the native JSON.parse method.
Does anyone have an explanation ?
What am I doing wrong ?
Here is the source-code for the two client-side benchmarks too.
The streaming benchmark client-side:
var limit = 100;
var max = 1000;
var oboeFirstTimes = [];
var oboeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> OBOE.js time:', (Date.now() - oboeStart));
console.log('> OBOE.js avg. first time:', (
oboeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
oboe('/api/spdy-stream?offset=' + offset + '&limit=' + limit)
.node('![*]', function () {
if (first) {
first = false;
oboeFirstTimes.push(Date.now() - parseStart);
}
})
.done(function () {
paginate(i + 1, offset + limit, limit);
});
}
paginate(0, 0, limit);
The blocking benchmark client-side:
var limit = 100;
var max = 1000;
var nativeFirstTimes = [];
var nativeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> NATIVE time:', (Date.now() - nativeStart));
console.log('> NATIVE avg. first time:', (
nativeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
var req = new XMLHttpRequest();
req.open('GET', '/api/spdy-stream?offset=' + offset + '&limit=' + limit, true);
req.onload = function () {
var json = JSON.parse(req.responseText);
json.forEach(function () {
if (first) {
first = false;
nativeFirstTimes.push(Date.now() - parseStart);
}
});
paginate(i + 1, offset + limit, limit);
};
req.send();
}
paginate(0, 0, limit);
Thanks in advance !

I found those comments in Oboe doc at the end of the "Why Oboe?" section:
Because it is a pure Javascript parser, Oboe.js requires more CPU time than JSON.parse. Oboe.js works marginally more slowly for small messages that load very quickly but for most real-world cases using i/o effectively beats optimising CPU time.
SAX parsers require less memory than Oboe’s pattern-based parsing model because they do not build up a parse tree. See Oboe.js vs SAX vs DOM.
If in doubt, benchmark, but don’t forget to use the real internet, including mobile, and think about perceptual performance.

Related

Crawling : Web scraping stops due to structural changes

While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.
If you can run your own javascript code in the webpage you can use MutationObserver providing the ability to watch for changes being made to the DOM tree.
Something like:
waitForDomStability(timeout: number) {
return new Promise(resolve => {
const waitResolve = observer => {
observer.disconnect();
resolve();
};
let timeoutId;
const observer = new MutationObserver((mutationList, observer) => {
for (let i = 0; i < mutationList.length; i += 1) {
// we only care if new nodes have been added
if (mutationList[i].type === 'childList') {
// restart the countdown timer
window.clearTimeout(timeoutId);
timeoutId = window.setTimeout(waitResolve, timeout, observer);
break;
}
}
});
timeoutId = setTimeout(waitResolve, timeout, observer);
// start observing document.body
observer.observe(document.body, { attributes: true, childList: true, subtree: true });
});
}
I'm using this approach in the open source scraping extension get-set-fetch. For full code look at /packages/background/src/ts/plugins/builtin/FetchPlugin.ts from the repo.
You can certainly use "snapshots" for comparing 2 versions of the same page. I've implemented something similar to java String hashCode to achieve this.
Code in javascript:
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
const snapshotSelector = 'body';
const nodeToBeHashed = document.querySelector(snapshotSelector);
if (!nodeToBeHashed) return 0;
const { innerText } = nodeToBeHashed;
let hash = 0;
if (innerText.length === 0) {
return hash;
}
for (let i = 0; i < innerText.length; i += 1) {
// an integer between 0 and 65535 representing the UTF-16 code unit
const charCode = innerText.charCodeAt(i);
// multiply by 31 and add current charCode
hash = ((hash << 5) - hash) + charCode;
// convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
hash |= 0;
}
return hash;
}
If you can't run javascript code in the page, you can use the entire html response as the content to be hashed in your favorite language.

getBulkProperties() Hangs and Errors Out

Our application needs to pull a set of properties from all objects in the model. Our application will concatenate properties from leaf nodes with properties from the parent nodes.
We are calling the getBulkProperties() method with around 20K nodes and around 5 properties. This runs for quite some time and then we receive server errors and the callbacks are never invoked.
Is there a limit we should use? Should we split these calls with a max number X of nodes?
Any help would be appreciated as this is causing our application to hang.
Thanks!
I don't think there is a limit, but you may consider listing properties for a specific group at a time, or just leaf nodes.
This blog post shows how to optimize the search performance, and the code below (from this post) how to integrate with .getBulkProperties:
viewer.search('Steel',
function(dbIds){
viewer.model.getBulkProperties(dbIds, ['Mass'],
function(elements){
var totalMass = 0;
for(var i=0; i<elements.length; i++){
totalMass += elements[i].properties[0].displayValue;
}
console.log(totalMass);
})
}, null, ['Material'])
You may also consider enurating only leaf on the model, as shown at this post and below:
function getAllLeafComponents(viewer, callback) {
var cbCount = 0; // count pending callbacks
var components = []; // store the results
var tree; // the instance tree
function getLeafComponentsRec(parent) {
cbCount++;
if (tree.getChildCount(parent) != 0) {
tree.enumNodeChildren(parent, function (children) {
getLeafComponentsRec(children);
}, false);
} else {
components.push(parent);
}
if (--cbCount == 0) callback(components);
}
viewer.getObjectTree(function (objectTree) {
tree = objectTree;
var allLeafComponents = getLeafComponentsRec(tree.getRootId());
});
}

Select specific data in JSON output

I've created a function that does a http request and then saves some data from the JSON output.
$scope.addMovie = function() {
'http://api.themoviedb.org/3/movie/206647?api_key=a8f7039633f2065942cd8a28d7cadad4&append_to_response=releases'
// Search for release dates using the ID.
var base = 'http://api.themoviedb.org/3/movie/';
var movieID = $(event.currentTarget).parent().find('.movieID').text()
var apiKey = 'a8f7039633f2065942cd8a28d7cadad4&query='
var append_to_response = '&append_to_response=releases'
var callback = 'JSON_CALLBACK'; // provided by angular.js
var url = base + movieID + '?api_key=' + apiKey + append_to_response + '&callback=' + callback;
$http.jsonp(url,{ cache: true}).
success(function(data, status, headers, config) {
if (status == 200) {
// $scope.movieListID = data.results;
$scope.movieListID = data;
console.log($scope.movieListID);
createMovie.create({
title: $scope.movieListID.original_title,
release_date: $scope.movieListID.release_date,
image: $scope.movieListID.poster_path
}).then(init);
} else {
console.error('Error happened while getting the movie list.')
}
})
};
This function saves the title, release date en posterpath and that works fine. The problem is that it only saves one release_date while the JSON output has a lot more, but I don't know how to acces that.
This is a example of the JSON output I request
It has a release_date, which I save now, but it also has more information,
releases":{
"countries":[
{"certification":"","iso_3166_1":"GB","primary":true,"release_date":"2015-10-26"},
{"certification":"","iso_3166_1":"US","primary":false,"release_date":"2015-11-06"},
{"certification":"","iso_3166_1":"NL","primary":false,"release_date":"2015-11-05"},
{"certification":"","iso_3166_1":"BR","primary":false,"release_date":"2015-11-05"},
{"certification":"","iso_3166_1":"SE","primary":false,"release_date":"2015-11-04"},
{"certification":"","iso_3166_1":"IE","primary":false,"release_date":"2015-10-26"},
How would I go about saving the release date for the NL release?
You just need to iterate through the countries array, and check if the country code matches the one you wish to retrieve. For your example with 'NL':
var releaseNL;
for (var i = 0; i < $scope.movieList.releases.countries.length; i++) {
var release = $scope.movieList.releases.countries[i];
if (release['iso_3166_1'] == 'NL') {
releaseNL = release;
}
}
This is just one of many ways to do this (e.g. you could use angular.forEach, wrap it inside a function, etc.), but this should give you an idea.
Remark: I noticed you have been asking a lot of very basic questions today, which you could easily answer yourself with a bit more research. E.g. this question is not even AngularJS related, but just a simple JavaScript task. So maybe try to show a bit more initiative next time! ;)

Using Q to return secondary query in node with express and mysql

New to node, As I am cycling through a roster of students, I need to check and see if a teacher has requested them for tutoring.
I realized I can't just do this:
var checkRequest = function(id){
var value = '';
roster.query('SELECT * FROM teacher_request WHERE student_id ='+id, function(err, row){
value = row.length;
}
return value;
}
After a bit of digging around promises looked like a great solution, but if I simply return the deferred.promise from the checkRequest function, all I get is an object that says [deferred promise] which I can't access the actual data from. (Or have not figured out how yet)
If I follow along with their api and use .then (as illustrated in the getRow) function, I am back in the same problem I was in before.
function checkRequest(id) {
console.log(id);
var deferred = Q.defer();
connection.query('SELECT * FROM teacher_request WHERE student_id ='+id, function(err, row){
deferred.resolve(row.length);
});
return deferred.promise;
}
var getRow = function(id){
checkRequest(id).then(function(val) {
console.log(val); // works great
return val; //back to the same problem
});
}
The roster needs to be able to be pulled from an external API which is why I am not bundling the request check with the original roster query.
Thanks in advance
From the stuff you posted, I assume you have not really understood the concept of promises. They allow you to queue up callbacks, that get executed, when the asynchronous operation has finished (by succeeding or failing).
So instead of somehow getting the results back to your synchronous workflow, you should convert that workflow to work asynchronous as well. So a small example for your current problem:
// your students' ids in here
var studentsArray = [ 1, 2, 5, 6, 9 ];
for( var i=0; i<studentsArray.length; i++ ) {
checkRequest( i )
.then( function( data ){
console.log( data.student_id );
// any other code related to a specific student in here
});
}
or another option, if you need all students' data at the same time:
// your students' ids in here
var studentsArray = [ 1, 2, 5, 6, 9 ];
// collect all promises
var reqs = [];
for( var i=0; i<studentsArray.length; i++ ) {
reqs.push( checkRequest( i ) );
}
Q.all( reqs )
.then( function(){
// code in here
// use `arguments` to access data
});

Choppy/inaudible playback with chunked audio through Web Audio API

I brought this up in my last post but since it was off topic from the original question I'm posting it separately. I'm having trouble with getting my transmitted audio to play back through Web Audio the same way it would sound in a media player. I have tried 2 different transmission protocols, binaryjs and socketio, and neither make a difference when trying to play through Web Audio. To rule out the transportation of the audio data being the issue I created an example that sends the data back to the server after it's received from the client and dumps the return to stdout. Piping that into VLC results in a listening experience that you would expect to hear.
To hear the results when playing through vlc, which sounds the way it should, run the example at https://github.com/grkblood13/web-audio-stream/tree/master/vlc using the following command:
$ node webaudio_vlc_svr.js | vlc -
For whatever reason though when I try to play this same audio data through Web Audio it fails miserably. The results are random noises with large gaps of silence in between.
What is wrong with the following code that is making the playback sound so bad?
window.AudioContext = window.AudioContext || window.webkitAudioContext;
var context = new AudioContext();
var delayTime = 0;
var init = 0;
var audioStack = [];
client.on('stream', function(stream, meta){
stream.on('data', function(data) {
context.decodeAudioData(data, function(buffer) {
audioStack.push(buffer);
if (audioStack.length > 10 && init == 0) { init++; playBuffer(); }
}, function(err) {
console.log("err(decodeAudioData): "+err);
});
});
});
function playBuffer() {
var buffer = audioStack.shift();
setTimeout( function() {
var source = context.createBufferSource();
source.buffer = buffer;
source.connect(context.destination);
source.start(context.currentTime);
delayTime=source.buffer.duration*1000; // Make the next buffer wait the length of the last buffer before being played
playBuffer();
}, delayTime);
}
Full source: https://github.com/grkblood13/web-audio-stream/tree/master/binaryjs
You really can't just call source.start(audioContext.currentTime) like that.
setTimeout() has a long and imprecise latency - other main-thread stuff can be going on, so your setTimeout() calls can be delayed by milliseconds, even tens of milliseconds (by garbage collection, JS execution, layout...) Your code is trying to immediately play audio - which needs to be started within about 0.02ms accuracy to not glitch - on a timer that has tens of milliseconds of imprecision.
The whole point of the web audio system is that the audio scheduler works in a separate high-priority thread, and you can pre-schedule audio (starts, stops, and audioparam changes) at very high accuracy. You should rewrite your system to:
1) track when the first block was scheduled in audiocontext time - and DON'T schedule the first block immediately, give some latency so your network can hopefully keep up.
2) schedule each successive block received in the future based on its "next block" timing.
e.g. (note I haven't tested this code, this is off the top of my head):
window.AudioContext = window.AudioContext || window.webkitAudioContext;
var context = new AudioContext();
var delayTime = 0;
var init = 0;
var audioStack = [];
var nextTime = 0;
client.on('stream', function(stream, meta){
stream.on('data', function(data) {
context.decodeAudioData(data, function(buffer) {
audioStack.push(buffer);
if ((init!=0) || (audioStack.length > 10)) { // make sure we put at least 10 chunks in the buffer before starting
init++;
scheduleBuffers();
}
}, function(err) {
console.log("err(decodeAudioData): "+err);
});
});
});
function scheduleBuffers() {
while ( audioStack.length) {
var buffer = audioStack.shift();
var source = context.createBufferSource();
source.buffer = buffer;
source.connect(context.destination);
if (nextTime == 0)
nextTime = context.currentTime + 0.05; /// add 50ms latency to work well across systems - tune this if you like
source.start(nextTime);
nextTime+=source.buffer.duration; // Make the next buffer wait the length of the last buffer before being played
};
}