How write and immediately read a file nodeJS - json

I have to obtain a json that is incrusted inside a script tag in certain page... so I can't use regular scraping techniques, like cheerio.
Easy way out, write the file (download the page) to the server and then read it using string manipulation to extract the json (there are several) work on them and save to my db hapily.
the thing is that I'm too new to nodeJS, and can't get the code to work, I think that I'm trying to read the file before it is fully written, and if read it time before obtain [Object Object]...
Here's what I have so far...
var http = require('http');
var fs = require('fs');
var request = require('request');
var localFile = 'tmp/scraped_site_.html';
var url = "siteToBeScraped.com/?searchTerm=foobar"
// writing
var file = fs.createWriteStream(localFile);
var request = http.get(url, function(response) {
response.pipe(file);
});
//reading
var readedInfo = fs.readFileSync(localFile, function (err, content) {
callback(url, localFile);
console.log("READING: " + localFile);
console.log(err);
});

So first of all I think you should understand what went wrong.
The http request operation is asynchronous. This means that the callback code in http.get() will run sometime in the future, but the fs.readFileSync, due to its synchronous nature will execute and complete even before the http request will actually be sent to the background thread that will execute it, since they are both invoked in what is commonly known as the (same) tick. Also fs.readFileSync returns a value and does not use a callback.
Even if you replace fs.readFileSync with fs.readFile instead the code still might not work properly since the readFile operation might execute before the http response is fully read from the socket and written to the disk.
I strongly suggest reading: stackoverflow question and/or Understanding the node.js event loop
The correct place to invoke the file read is when the response stream has finished writing to the file, which would look something like this:
var request = http.get(url, function(response) {
response.pipe(file);
file.once('finish', function () {
fs.readFile(localFile, /* fill encoding here */, function(err, data) {
// do something with the data if there is no error
});
});
});
Of course this is a very raw and not recommended way to write asynchronous code but that is another discussion altogether.
Having said that, if you download a file, write it to the disk and then read it all back again to the memory for manipulation, you might as well forgo the file part and just read the response into a string right away. Your code will then look something like so (this can be implemented in several ways):
var request = http.get(url, function(response) {
var data = '';
function read() {
var chunk;
while ( chunk = response.read() ) {
data += chunk;
}
}
response.on('readable', read);
response.on('end', function () {
console.log('[%s]', data);
});
});
What you really should do IMO is to create a transform stream that will strip away all the data you need from the response, while not consuming too much memory and yielding this more elegantly looking code:
var request = http.get(url, function(response) {
response.pipe(yourTransformStream).pipe(file)
});
Implementing this transform stream, however, might prove slightly more complex. So if you're a node beginner and you don't plan on downloading big files or lots of small files than maybe loading the whole thing into memory and doing string manipulations on it might be simpler.
For further information about transformation streams:
node.js stream api
this wonderful guide by substack
this post from strongloop
Lastly, see if you can use any of the million node.js crawlers already out there :-) take a look at these search results on npm

According to the http module help 'get' does not return the response body
This is modified from the request example on the same page
What you need to do is process the response with in the callback (function) passed into http.request so it can be called when it is ready (async)
var http = require('http')
var fs = require('fs')
var localFile = 'tmp/scraped_site_.html'
var file = fs.createWriteStream(localFile)
var req = http.request('http://www.google.com.au', function(res) {
res.pipe(file)
res.on('end', function(){
file.end()
fs.readFile(localFile, function(err, buf){
console.log(buf.toString())
})
})
})
req.on('error', function(e) {
console.log('problem with request: ' + e.message)
})
req.end();
EDIT
I updated the example to read the file after it is created. This works by having a callback on the end event of the response which closes the pipe and then it can reopen the file for reading. Alternatively you can use
req.on('data', function(chunk){...})
to process the data as it arrives without putting it into a temporary file

My impression is that you serializing a js object into JSON by reading it from a stream that's downloading a file containing HTML. This is do-able yet hard. Its difficult to know when you're search expression is found because if you parse as the chunks come in then you never know if you received only context and you could never find what you're looking for because it was split into 2 or many parts which were never analyzed as a whole.
You could try something like this:
http.request('u/r/l',function(res){
res.on('data',function(data){
//parse data as it comes in
}
});
This allows you to read data as it comes in. You can handle it to save to disc, db, or even parse it if you accumulated the contents within the script tags into a single string then parsed objects in that.

Related

How to parse json newline delimited in Angular 2

I am writing an Angular 2 app (built with angular cli), and trying to use AWS Polly text-to-speech API.
According to the API you can request audio output as well as "Speech Marks" which can describe word timing, visemes, etc. The audio is delivered as "mp3" format, and the speech marks as "application/x-json-stream", which I understand as a "new line" delimited JSON. It cannot be parsed with JSON.parse() due to the new lines. I have yet been unable to read/parse this data. I have looked at several libs that are for "json streaming" but they are all built for node.js and won't work with Angular 2. My code is as follows...
onClick() {
AWS.config.region = 'us-west-2';
AWS.config.accessKeyId = 'xxxxx';
AWS.config.secretAccessKey = 'yyyyy';
let polly = new AWS.Polly();
var params = {
OutputFormat: 'json',
Text: 'Hello world',
VoiceId: 'Joanna',
SpeechMarkTypes:['viseme']
};
polly.synthesizeSpeech(params, (err, data) => {
if (err) {
console.log(err, err.stack);
} else {
var uInt8Array = new Uint8Array(data.AudioStream);
var arrayBuffer = uInt8Array.buffer;
var blob = new Blob([arrayBuffer]);
var url = URL.createObjectURL(blob);
this.audio.src = url;
this.audio.play(); // works fine
// speech marks info displays "application/x-json-stream"
console.log(data.ContentType);
}
});
Strangely enough Chrome browser knows how to read this data and displays it in the response.
Any help would be greatly appreciated.
I had the same problem. I saved the file so I could then read it line by line, accessing the JSON objects when I need to highlight words being read. Mind you this is probably not the most effective way, but an easy way to move on and get working on the fun stuff.
I am trying out different ways to work with Polly, will update answer if I find a better way
You can do it with:
https://www.npmjs.com/package/ndjson-parse
That worked for me.
But I can't play audio, I tried your code it says
DOMException: Failed to load because no supported source was found.

Raspberry PI server - GPIO ports status JSON response

I'm struggling for a couple of days. Question is simple, is there a way that can I create a server on Raspberry PI that will return current status of GPIO ports in JSON format?
Example:
Http://192.168.1.109:3000/led
{
"Name": "green led",
"Status": "on"
}
I found Adafruit gpio-stream library useful but don't know how to send data to JSON format.
Thank you
There are a variety of libraries for gpio interaction for node.js. One issue is that you might need to run it as root to have access to gpio, unless you can adjust the read access for those devices. This is supposed to be fixed in the latest version of rasbian though.
I recently built a node.js application that was triggered from a motion sensor, in order to activate the screen (and deactivate it after a period of time). I tried various gpio libraries but the one that I ended up using was "onoff" https://www.npmjs.com/package/onoff mainly because it seemed to use an appropriate way to identify changes on the GPIO pins (using interrupts).
Now, you say that you want to send data, but you don't specify how that is supposed to happen. If we use the example that you want to send data using a POST request via HTTP, and send the JSON as body, that would mean that you would initialize the GPIO pins that you have connected, and then attach event handlers for them (to listen for changes).
Upon a change, you would invoke the http request and serialize the JSON from a javascript object (there are libraries that would take care of this as well). You would need to keep a name reference yourself since you only address the GPIO pins by number.
Example:
var GPIO = require('onoff').Gpio;
var request = require('request');
var x = new GPIO(4, 'in', 'both');
function exit() {
x.unexport();
}
x.watch(function (err, value) {
if (err) {
console.error(err);
return;
}
request({
uri: 'http://example.org/',
method: 'POST',
json: true,
body: { x: value } // This is the actual JSON data that you are sending
}, function () {
// this is the callback from when the request is finished
});
});
process.on('SIGINT', exit);
I'm using the npm modules onoff and request. request is used for simplifying the JSON serialization over a http request.
As you can see, I only set up one GPIO here. If you need to track multiple, you must make sure to initialize them all, distinguish them with some sort of name and also remember to unexport them in the exit callback. Not sure what happens if you don't do it, but you might lock it for other processes.
Thank You, this was very helpful. I did not express myself well, sorry for that. I don't want to send data (for now) i just want to enter web address like 192.168.1.109/led and receive json response. This is what I manage to do for now. I don't know if this is the right way. PLS can you review this or suggest better method..
var http = require('http');
var url = require('url');
var Gpio = require('onoff').Gpio;
var led = new Gpio(23, 'out');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
var command = url.parse(req.url).pathname.slice(1);
switch(command) {
case "on":
//led.writeSync(1);
var x = led.readSync();
res.write(JSON.stringify({ msgId: x }));
//res.end("It's ON");
res.end();
break;
case "off":
led.writeSync(0);
res.end("It's OFF");
break;
default:
res.end('Hello? yes, this is pi!');
}
}).listen(8080);

Accessing indexedDB in ServiceWorker. Race condition

There aren't many examples demonstrating indexedDB in a ServiceWorker yet, but the ones I saw were all structured like this:
const request = indexedDB.open( 'myDB', 1 );
var db;
request.onupgradeneeded = ...
request.onsuccess = function() {
db = this.result; // Average 8ms
};
self.onfetch = function(e)
{
const requestURL = new URL( e.request.url ),
path = requestURL.pathname;
if( path === '/test' )
{
const response = new Promise( function( resolve )
{
console.log( performance.now(), typeof db ); // Average 15ms
db.transaction( 'cache' ).objectStore( 'cache' ).get( 'test' ).onsuccess = function()
{
resolve( new Response( this.result, { headers: { 'content-type':'text/plain' } } ) );
}
});
e.respondWith( response );
}
}
Is this likely to fail when the ServiceWorker starts up, and if so what is a robust way of accessing indexedDB in a ServiceWorker?
Opening the IDB every time the ServiceWorker starts up is unlikely to be optimal, you'll end up opening it even when it isn't used. Instead, open the db when you need it. A singleton is really useful here (see https://github.com/jakearchibald/svgomg/blob/master/src/js/utils/storage.js#L5), so you don't need to open IDB twice if it's used twice in its lifetime.
The "activate" event is a great place to open IDB and let any "onupdateneeded" events run, as the old version of ServiceWorker is out of the way.
You can wrap a transaction in a promise like so:
var tx = db.transaction(scope, mode);
var p = new Promise(function(resolve, reject) {
tx.onabort = function() { reject(tx.error); };
tx.oncomplete = function() { resolve(); };
});
Now p will resolve/reject when the transaction completes/aborts. So you can do arbitrary logic in the tx transaction, and p.then(...) and/or pass a dependent promise into e.respondWith() or e.waitUntil() etc.
As noted by other commenters, we really do need to promisify IndexedDB. But the composition of its post-task autocommit model and the microtask queues that Promises use make it... nontrivial to do so without basically completely replacing the API. But (as an implementer and one of the spec editors) I'm actively prototyping some ideas.
I don't know of anything special about accessing IndexedDB from the context of a service worker via accessing IndexedDB via a controlled page.
Promises obviously makes your life much easier within a service worker, so I've found using something like, e.g., https://gist.github.com/inexorabletash/c8069c042b734519680c to be useful instead of the raw IndexedDB API. But it's not mandatory as long as you create and manage your own promises to reflect the state of the asynchronous IndexedDB operations.
The main thing to keep in mind when writing a fetch event handler (and this isn't specific to using IndexedDB), is that if you call event.respondWith(), you need to pass in either a Response object or a promise that resolves with a Response object. As long as you're doing that, it shouldn't matter whether your Response is constructed from IndexedDB entries or the Cache API or elsewhere.
Are you running into any actual problems with the code you posted, or was this more of a theoretical question?

fs.readstream to read an object and then pipe to writeable to file?

Currently I have a module pulling sql results like this:
[{ID: 'test', NAME: 'stack'},{ID: 'test2', NAME: 'stack'}]
I want to just literally have that written to file so i can read it as an object later, but i want to write it by stream because some of the objects are really really huge and keeping them in memory isnt working anymore.
I am using mssql https://www.npmjs.org/package/mssql
and I am stuck at here:
request.on('recordset', function(result) {
console.log(result);
});
how do I stream this out to a writable stream? I see options for object mode but i cant seem to figure out how to set it?
request.on('recordset', function(result) {
var readable = fs.createReadStream(result),
writable = fs.createWriteStream("loadedreports/bot"+x[6]);
readable.pipe(writable);
});
this just errors because createReadStream must be a filepath...
am I on the right track here or do I need to do something else?
You´re almost on the right track: You just dont need a readable stream, since your data already arrives in chunks.
Then, you can just create the writeable stream OUTSIDE of the actual 'recordset'-Event, else you would create a new stream everytime you get a new chunk (and this is not what you want).
Try it like this:
var writable = fs.createWriteStream("loadedreports/bot"+x[6]);
request.on('recordset', function(result) {
writable.write(result);
});
EDIT
If the recordset is already too big, use the row-Event:
request.on('row', function(row) {
// Same here
});

Efficient way to append JSON-object without parse it

I wrote a small script listening on an udp-port and stores all incoming messages (one JSON object) inside a single file. The empty file contains an array in JSON format.
I'm looking for an efficient way to store all (concurrently) incoming messages from multiple clients inside this single file.
The files size can be multiple hundred of megabytes large. Parsing the file and appending the new object wouldn't be efficient as needed.
Do you have an approach?
EDIT
My solution, based on #t-j-crowder approach:
var dgram = require("dgram");
var fs = require("fs");
var udp_server = dgram.createSocket("udp4");
var udp_server_port = 5000
udp_server.on("message", function (msg, rinfo) {
var json_part = "{\"message\": " + msg + "}";
fs.open('./data/stats.json','r+',function(err,fd){
if(err) throw err
fs.fstat(fd,function(err,stats){
if(err) throw err
if(stats.size>2){
json_part = new Buffer(','+json_part+']','utf-8');
var pos = parseInt(stats.size)-1;
}else{
json_part = new Buffer('['+json_part+']','utf-8');
var pos = 0;
}
fs.write(fd,json_part, 0, json_part.length, pos, function(err,written,buffer){
if(err) throw err
fs.close(fd,function(){
});
});
});
});
});
udp_server.bind(udp_server_port);
Regards, Marcus
Fundamentally, you'll need to:
Open the file using a seekable, writable stream.
Seek to the end of it.
Back up one character (over the closing ] of the array).
Write out a comma (if this isn't the first entry) and the JSON of your new entry.
Write a closing ].
Close the file.
Looking at the NodeJS docs, it looks like Steps 2-4 (and arguably 5) are done all together, using the position argument of fs.write. (Be sure you open the file using r+, not one of the "append" modes.)