I wrote a small script listening on an udp-port and stores all incoming messages (one JSON object) inside a single file. The empty file contains an array in JSON format.
I'm looking for an efficient way to store all (concurrently) incoming messages from multiple clients inside this single file.
The files size can be multiple hundred of megabytes large. Parsing the file and appending the new object wouldn't be efficient as needed.
Do you have an approach?
EDIT
My solution, based on #t-j-crowder approach:
var dgram = require("dgram");
var fs = require("fs");
var udp_server = dgram.createSocket("udp4");
var udp_server_port = 5000
udp_server.on("message", function (msg, rinfo) {
var json_part = "{\"message\": " + msg + "}";
fs.open('./data/stats.json','r+',function(err,fd){
if(err) throw err
fs.fstat(fd,function(err,stats){
if(err) throw err
if(stats.size>2){
json_part = new Buffer(','+json_part+']','utf-8');
var pos = parseInt(stats.size)-1;
}else{
json_part = new Buffer('['+json_part+']','utf-8');
var pos = 0;
}
fs.write(fd,json_part, 0, json_part.length, pos, function(err,written,buffer){
if(err) throw err
fs.close(fd,function(){
});
});
});
});
});
udp_server.bind(udp_server_port);
Regards, Marcus
Fundamentally, you'll need to:
Open the file using a seekable, writable stream.
Seek to the end of it.
Back up one character (over the closing ] of the array).
Write out a comma (if this isn't the first entry) and the JSON of your new entry.
Write a closing ].
Close the file.
Looking at the NodeJS docs, it looks like Steps 2-4 (and arguably 5) are done all together, using the position argument of fs.write. (Be sure you open the file using r+, not one of the "append" modes.)
Related
I have an array = [ 'something', 'other' ]
And I want to retrieve only the values of those 2 ids from Firebase, which contains more than 2 items ( potentially millions ), but if I do this:
var questionRef = new Firebase(fireBaseURL+"/morethanamillionitems/");
loadUID.once('value', function (dataSnapshot) {
dataSnapshot.forEach(function(childSnapshot) { // Firebase method
console.log(dataSnapshot.numChildren()); // potentially outputs 1.000.000 +
var uid = childSnapshot.name();
var childData = childSnapshot.val();
console.log(uid.indexOf('something'));
result.push(uid)
});
}
I first basically load the whole database, which is not that efficient
Now I could do:
array.forEach(key, function() {
var questionRef = new Firebase(fireBaseURL+"/morethanamillionitems/"+key);
refID = questionRef.val();
result.push(refID);
})
Or maybe:
questionRef = new Firebase(fireBaseURL+"/morethanamillionitems/");
array.forEach(key, function() {
if ( questionRef.child(key) !== null ){
refID = questionRef.val();
result.push(refID);
}
})
The last one seems the nicest, the previous one seems a bit expensive on the old RAM.
However, I apparently have to call questionRef.once('value', function(){}) each time, hence already loading the whole document-root...
Or am I misunderstanding how Firebase handles these requests? is the .numChildren() just an answer directly from the server?
Is the .forEach actually remotely executed?
I'm wondering if there is any other way to reduce traffic per request. Which brings me to another question: it seems that firebase searches locally first, but eventually will search remotely, but it's not clear when this exactly happens. Does it periodically check if something has changed? Or will that only happend when I use .on() and not .once().
Or am I using the wrong backend for this purpose? Any other suggestions? I tried hood.ie which is still very beta, looked at Parse but firebase seemed to have the simplicity I need.
(sorry for the sloppy syntax, but you can see what I intended)
[update]
I now have this:
load: function(uids){
var FB = new Firebase(URL);
uids.map(function(uid) {
var currentRef = FB.child( uid+"/_current" );
currentRef.once('value', function (each) {
eachVal = each.val()
if (eachVal !== null){
var localSave = {};
localSave[uid] = eachVal;
this.saveLocal(localSave)
} else {
console.error("Not found: [%s]", uid)
}}, function (err) { });
});
}
But I'm still wondering when the request actually happens, on .child()? or in .once() and if the latter, what is the use of .child() exactly? It seems it's only used for referencing.
Then the second thing, if I want to retrieve an array of a hundred items, this would still mean a hundred seperate requests? or does Firebase have a way of collecting requests and then send them in a batch?
In that last case .once would be a more 'conservative' option for initial retrieval, then later you could attach a .on listener if you need real-time updates.
I've started using Gulp JS and must admit I'm finding it really useful.
One of the tasks I need to perform is zip up a collection of folders into individual zip files, one for each folder and then zip all this zipped files up into one single zip file. Using Gulp-Zip I've managed to get this far:
var modelFolders = [
'ELFH_Check',
'ELFH_DDP',
'ELFH_Free'
];
gulp.task('zipModels', function () {
for (var i = 0; i < modelFolders.length; i++) {
var model = modelFolders[i];
gulp.src('**/*', {cwd: path.join(process.cwd(), '/built_templates/' + model) })
.pipe(zip(model + '.zip'))
.pipe(gulp.dest('./built_templates'));
};
});
This works and outputs ELFH_Check.zip, ELFH_DDP.zip and ELFH_Free.zip. However, I then need to zip up these zip files into one zip file called "Templates.zip" and I've not managed to get this task to work:
// zip up model files
gulp.task('zipTemplate', ['zipModels'], function () {
gulp.src('*.zip', {cwd: path.join(process.cwd(), './built_templates/') })
.pipe(zip('Templates_.zip'))
.pipe(gulp.dest('./built_templates'));
});
Does anyone know if this is possible or what I'm doing wrong?
I saw the problem as well, and it seems to be related to the cwd option somehow. I'll investigate further.
After #OverZealous comment, I investigated further and found two issues:
As he said, you need to hint gulp to wait until the end of the dependency task (zipModels), by returning a stream from it. As you have multiple streams, you can use event-stream.merge to return a bundle stream.
The reason why the bundle zip wouldn't work, is because you cwd points to /built_templates/, and the second slash is causing some problem. To work properly, you need to remove the trailing slash, so it should be path.join(process.cwd(), '/built_templates').
IMPORTANT
Anyway, you should avoid temporary files. Gulp philosophy is to try using pipes to avoid IO. In that direction, what you want to do is to cut the intermediary dest steps, merge the streams, zip them, and finally, output them.
Something like that:
var es = require('event-stream');
var modelFolders = [
'ELFH_Check',
'ELFH_DDP',
'ELFH_Free'
];
gulp.task('zipModels', function () {
var zips = [],
modelZip;
for (var i = 0; i < modelFolders.length; i++) {
var model = modelFolders[i];
modelZip = gulp.src('**/*', {cwd: path.join(process.cwd(), '/built_templates/' + model) })
.pipe(zip(model + '.zip'));
// notice we removed the dest step and store the zip stream (still in memory)
zips.push(modelZip);
};
// we finally merge them (the zips), zip them again, and output.
return es.merge.apply(null, zips)
.pipe(zip('templates.zip'))
.pipe(gulp.dest('./'));
});
By the name of your folder (built_templates), it seems you have some other task that will generate the temporary built files. Preferably, you don't want these as well. You should pipe their streams directly to the ZIP stream, a finally, to the bundle-zip stream. By doing that, you would have a simple stream flow, with one disk read, and one disc write at the end, with no temporary files.
If you need them to be different tasks, consider having a function that will generate the stream up to the step before the gulp.dest pipe, and use this function on all subtasks.
Additionally, always try to hint your async tasks by returning a stream, a promise or receiving a callback function, and advise the end of the task.
I have to obtain a json that is incrusted inside a script tag in certain page... so I can't use regular scraping techniques, like cheerio.
Easy way out, write the file (download the page) to the server and then read it using string manipulation to extract the json (there are several) work on them and save to my db hapily.
the thing is that I'm too new to nodeJS, and can't get the code to work, I think that I'm trying to read the file before it is fully written, and if read it time before obtain [Object Object]...
Here's what I have so far...
var http = require('http');
var fs = require('fs');
var request = require('request');
var localFile = 'tmp/scraped_site_.html';
var url = "siteToBeScraped.com/?searchTerm=foobar"
// writing
var file = fs.createWriteStream(localFile);
var request = http.get(url, function(response) {
response.pipe(file);
});
//reading
var readedInfo = fs.readFileSync(localFile, function (err, content) {
callback(url, localFile);
console.log("READING: " + localFile);
console.log(err);
});
So first of all I think you should understand what went wrong.
The http request operation is asynchronous. This means that the callback code in http.get() will run sometime in the future, but the fs.readFileSync, due to its synchronous nature will execute and complete even before the http request will actually be sent to the background thread that will execute it, since they are both invoked in what is commonly known as the (same) tick. Also fs.readFileSync returns a value and does not use a callback.
Even if you replace fs.readFileSync with fs.readFile instead the code still might not work properly since the readFile operation might execute before the http response is fully read from the socket and written to the disk.
I strongly suggest reading: stackoverflow question and/or Understanding the node.js event loop
The correct place to invoke the file read is when the response stream has finished writing to the file, which would look something like this:
var request = http.get(url, function(response) {
response.pipe(file);
file.once('finish', function () {
fs.readFile(localFile, /* fill encoding here */, function(err, data) {
// do something with the data if there is no error
});
});
});
Of course this is a very raw and not recommended way to write asynchronous code but that is another discussion altogether.
Having said that, if you download a file, write it to the disk and then read it all back again to the memory for manipulation, you might as well forgo the file part and just read the response into a string right away. Your code will then look something like so (this can be implemented in several ways):
var request = http.get(url, function(response) {
var data = '';
function read() {
var chunk;
while ( chunk = response.read() ) {
data += chunk;
}
}
response.on('readable', read);
response.on('end', function () {
console.log('[%s]', data);
});
});
What you really should do IMO is to create a transform stream that will strip away all the data you need from the response, while not consuming too much memory and yielding this more elegantly looking code:
var request = http.get(url, function(response) {
response.pipe(yourTransformStream).pipe(file)
});
Implementing this transform stream, however, might prove slightly more complex. So if you're a node beginner and you don't plan on downloading big files or lots of small files than maybe loading the whole thing into memory and doing string manipulations on it might be simpler.
For further information about transformation streams:
node.js stream api
this wonderful guide by substack
this post from strongloop
Lastly, see if you can use any of the million node.js crawlers already out there :-) take a look at these search results on npm
According to the http module help 'get' does not return the response body
This is modified from the request example on the same page
What you need to do is process the response with in the callback (function) passed into http.request so it can be called when it is ready (async)
var http = require('http')
var fs = require('fs')
var localFile = 'tmp/scraped_site_.html'
var file = fs.createWriteStream(localFile)
var req = http.request('http://www.google.com.au', function(res) {
res.pipe(file)
res.on('end', function(){
file.end()
fs.readFile(localFile, function(err, buf){
console.log(buf.toString())
})
})
})
req.on('error', function(e) {
console.log('problem with request: ' + e.message)
})
req.end();
EDIT
I updated the example to read the file after it is created. This works by having a callback on the end event of the response which closes the pipe and then it can reopen the file for reading. Alternatively you can use
req.on('data', function(chunk){...})
to process the data as it arrives without putting it into a temporary file
My impression is that you serializing a js object into JSON by reading it from a stream that's downloading a file containing HTML. This is do-able yet hard. Its difficult to know when you're search expression is found because if you parse as the chunks come in then you never know if you received only context and you could never find what you're looking for because it was split into 2 or many parts which were never analyzed as a whole.
You could try something like this:
http.request('u/r/l',function(res){
res.on('data',function(data){
//parse data as it comes in
}
});
This allows you to read data as it comes in. You can handle it to save to disc, db, or even parse it if you accumulated the contents within the script tags into a single string then parsed objects in that.
Let me preface this by stating that I am not terribly familiar with ActionScript, so forgive any seemingly obvious things that I may be missing.
I current have a very simple function with an AS3 application that will output a file when a button is clicked using a FileReference object as seen below :
//Example download event
public function download(event:MouseEvent):void
{
//Build a simple file to store the current file
var outputFile:FileReference = new FileReference();
//Perform a function to build a .wav file from the existing file
//this returns a ByteArray (buffer)
downloadBuffer = PrepareAudioFile();
//Attempt to build the filename (using the length of bytes as the file name)
var fileName:String = downloadBuffer.length.toString() + ".wav";
//Save the file
audioFile.save(downloadBuffer, fileName);
}
There appears to be an error occurring somewhere within here that is resulting in the File not being outputted at all when I attempt to concatenate the file name as seen above. However, if I replace the fileName variable with a hard-coded option similar to the following, it works just fine :
audioFile.save(downloadBuffer, "Audio.wav");
Ideally, I would love to derive the duration of the file based on the length of the byteArray using the following :
//Get the duration (in seconds) as it is an audio file encoded in 44.1k
var durationInSeconds:Number = downloadBuffer.length / 44100;
//Grab the minutes and seconds
var m:Number = Math.floor(durationInSeconds / 60);
var s:Number = Math.floor(durationInSeconds % 60);
//Create the file name using those values
audioFile.save(downloadBuffer, m.toString() + "_" + s.toString() + ".wav");
Any ideas would be greatly appreciated.
Where is the problem other than missing the parentheses in m.toString()?
Aren't you missing a .lenght before the division of downloadBuffer as well?
I was finally able to come up with a viable solution that required explicit typing of all of the variables (including using a separate variable for the .toString() operations) as seen below :
public function download(event:MouseEvent):void
{
//Build a simple file to store the current file
var outputFile:FileReference = new FileReference();
//Perform a function to build a .wav file from the existing file
//this returns a ByteArray (buffer)
downloadBuffer = PrepareAudioFile();
//When accessing the actual length, this needed to be performed separately (and strongly typed)
var bufferLength:uint = downloadBuffer.length;
//The string process also needed to be stored in a separate variable
var stringLength:String = bufferLength.toString();
//Use the variables to properly concatenate a file name
var fileName:String = dstringLength + ".wav";
//Save the file
audioFile.save(downloadBuffer, fileName);
}
It's bizarre that these had to explicitly be stored within separate values and couldn't simply be used in-line as demonstrated in the other examples.
I've implemented a client/server that communicate using a TCP socket. The data that I'm writing to the socket is stringified JSON. Initially everything works as expected, however, as I increase the rate of writes I eventually encounter JSON parse errors where the beginning on the client receives the beginning of the new write on the end of the old one.
Here is the server code:
var data = {};
data.type = 'req';
data.id = 1;
data.size = 2;
var string = JSON.stringify(data);
client.write(string, callback());
Here is how I am receiving this code on the client server:
client.on('data', function(req) {
var data = req.toString();
try {
json = JSON.parse(data);
} catch (err) {
console.log("JSON parse error:" + err);
}
});
The error that I'm receiving as the rate increases is:
SyntaxError: Unexpected token {
Which appears to be the beginning of the next request being tagged onto the end of the current one.
I've tried using ; as a delimiter on the end of each JSON request and then using:
var data = req.toString().substring(0,req.toString().indexOf(';'));
However this approach, instead of resulting in JSON parse errors seems to result in completely missing some requests on the client side as I increase the rate of writes over 300 per second.
Are there any best practices or more efficient ways to delimit incoming requests via TCP sockets?
Thanks!
Thanks everyone for the explanations, they helped me to better understand the way in which data is sent and received via TCP sockets. Below is a brief overview of the code that I used in the end:
var chunk = "";
client.on('data', function(data) {
chunk += data.toString(); // Add string on the end of the variable 'chunk'
d_index = chunk.indexOf(';'); // Find the delimiter
// While loop to keep going until no delimiter can be found
while (d_index > -1) {
try {
string = chunk.substring(0,d_index); // Create string up until the delimiter
json = JSON.parse(string); // Parse the current string
process(json); // Function that does something with the current chunk of valid json.
}
chunk = chunk.substring(d_index+1); // Cuts off the processed chunk
d_index = chunk.indexOf(';'); // Find the new delimiter
}
});
Comments welcome...
You're on the right track with using a delimiter. However, you can't just extract the stuff before the delimiter, process it, and then discard what came after it. You have to buffer up whatever you got after the delimiter and then concatenate what comes next to it. This means that you could end up with any number (including 0) of JSON "chunks" after a given data event.
Basically you keep a buffer, which you initialize to "". On each data event you concatenate whatever you receive to the end of the buffer and then split it the buffer on the delimiter. The result will be one or more entries, but the last one might not be complete so you need to test the buffer to make sure it ends with your delimiter. If not, you pop the last result and set your buffer to it. You then process whatever results remain (which might not be any).
Be aware that TCP does not make any guarantees about where it divides the chunks of data you recieve. All it guarantees is that all the bytes you send will be received in order, unless the connection fails entirely.
I believe Node data events come in whenever the socket says it has data for you. Technically you could get separate data events for each byte in your JSON data and it would still be within the limits of what the OS is allowed to do. Nobody does that, but your code needs to be written as if it could suddenly start happening at any time to be robust. It's up to you to combine data events and then re-split the data stream along boundaries that make sense to you.
To do that, you need to buffer any data that isn't "complete", including data appended to the end of a chunk of "complete" data. If you're using a delimiter, never throw away any data after the delimiter -- always keep it around as a prefix until you see either more data and eventually either another delimiter or the end event.
Another common choice is to prefix all data with a length field. Say you use a fixed 64-bit binary value. Then you always wait for 8 bytes, plus however many more the value in those bytes indicate, to arrive. Say you had a chunk of ten bytes of data incoming. You might get 2 bytes in one event, then 5, then 4 -- at which point you can parse the length and know you need 7 more, since the last 3 bytes of the third chunk were payload. If the next event actually contains 25 bytes, you'd take the first 7 along with the 3 from before and parse that, and look for another length field in bytes 8-16.
That's a contrived example, but be aware that at low traffic rates, the network layer will generally send your data out in whatever chunks you give it, so this sort of thing only really starts to show up as you increase the load. Once the OS starts building packets from multiple writes at once, it will start splitting on a granularity that is convenient for the network and not for you, and you have to deal with that.
Following this response :
var chunk = "";
client.on('data', function(data) {
chunk += data.toString(); // Add string on the end of the variable 'chunk'
d_index = chunk.indexOf(';'); // Find the delimiter
// While loop to keep going until no delimiter can be found
while (d_index > -1) {
try {
string = chunk.substring(0,d_index); // Create string up until the delimiter
json = JSON.parse(string); // Parse the current string
process(json); // Function that does something with the current chunk of valid json.
}
chunk = chunk.substring(d_index+1); // Cuts off the processed chunk
d_index = chunk.indexOf(';'); // Find the new delimiter
}
});
I get a problem with the delimiter because ; was part of my sent data.
It is possible to use this update in order to implement a custom delimiter :
var chunk = "";
const DELIMITER = (';;;');
client.on('data', function(data) {
chunk += data.toString(); // Add string on the end of the variable 'chunk'
d_index = chunk.indexOf(DELIMITER); // Find the delimiter
// While loop to keep going until no delimiter can be found
while (d_index > -1) {
try {
string = chunk.substring(0,d_index); // Create string up until the delimiter
json = JSON.parse(string); // Parse the current string
process(json); // Function that does something with the current chunk of valid json.
}
chunk = chunk.substring(d_index+DELIMITER.length); // Cuts off the processed chunk
d_index = chunk.indexOf(DELIMITER); // Find the new delimiter
}
});
I know this question is old but I have an answer for the people still looking at this.
As said in the answers above, the data event will be fired with a nodejs Buffer containing the data received.
res.on('data', function(chunk) {
//chunk contains the data
})
This next part doesnt seem to be commonly known. The end event is fired when all data is consumed. The close event is fired when the client disconnects
res.on('end', function() {
//the response body has been consumed
})
The full code to get the entire body is below
var body = Buffer.from('');
res.on('data', function(chunk) {
if (chunk && chunk.byteLength > 0) {
body = Buffer.concat([body, chunk]);
}
})
res.on('end', function() {
var data = JSON.parse(body.toString());
//data contains the response json
})
End event is fired when the data is all consumed: source
close event is fired when the request is closed: source
Try with end event and no data
var data = '';
client.on('data', function (chunk) {
data += chunk.toString();
});
client.on('end', function () {
data = JSON.parse(data); // use try catch, because if a man send you other for fun, you're server can crash.
});
Hope help you.