Ruby and parsing huge JSON-Lines from download - json

Ruby 3.1.0
I am trying to parse JSON Lines without blowing up memory. My routine prints nothing. I am wondering where I am going wrong. I open a tempfile to hold the huge file, which I am thinking is mistake #1. But I don't know how else to structure this. I then try and copy the huge file from Google to my tempfile, and then step through that one line at a time. I get nothing... I am perplexed.
Oh. I figured it out. copy_stream leaves the file at EOF. I just had to rewind it to use it.
require "tempfile"
require "open-uri"
require "json"
url = "https://storage.googleapis.com/somehugefile.jsonl"
inventory_file = Tempfile.new
inventory_file.binmode
uri = URI(url)
IO.copy_stream(uri.open, inventory_file)
f = File.foreach(inventory_file)
f.each_entry {|line| puts JSON.parse(line) }

It was simple. I did not know that copy_stream method left the file pointer at the end of the file. So I just had to do a rewind on it, and it all worked as expected.

Related

Extracting the outputs/results from an executed .pexe file

My goal is to convert a C++ program in to a .pexe file in order to execute it later on a remote computer. The .pexe file will contain some mathematical formulas or functions to be calculated on a remote computer, so I’ll be basically using the computational power of the remote computer. For all this I’ll be using the nacl_sdk with the Pepper library and I will be grateful if someone could clarify some things for me:
Is it possible to save the outputs of the executed .pexe file on the remote computer in to a file, if it’s possible then how? Which file formats are supported?
Is it possible to send the outputs of the executed .pexe file on the remote computer automatically to the host computer, if it’s possible then how?
Do I have to install anything for that to work on the remote computer?
Any suggestion will be appreciated.
From what I've tried it seems like you can't capture the stuff that your pexe writes to stdout - it just goes to the stdout of the browser (it took me hours to realize that it does go somewhere - I followed a bad tutorial that had me believe the pexes stdout was going to be posted to the javascript side and was wondering why it "did nothing").
I currently work on porting my stuff to .pexe also, and it turned out to be quite simple, but that has to do with the way I write my programs:
I write my (C++) programs such that all code-parts read inputs only from an std::istream object and write their outputs to some std::ostream object. Then I just pass std::cin and std::cout to the top-level call and can use the program interactively in the shell. But then I can easily swap out the top-level call to use an std::ifstream and std::ofstream to use the program for batch-processing (without pipes from cat and redirecting to files, which can be troublesome under some circumstances).
Since I write my programs like that, I can just implement the message handler like
class foo : public pp::Instance {
... ctor, dtor,...
virtual void HandleMessage(const pp::Var& msg) override {
std::stringstream i, o;
i << msg.AsString();
toplevelCall(i,o);
PostMessage(o.str());
}
};
so the data I get from the browser is put into a stringstream, which the rest of the code can use for inputs. It gets another stringstream where the rest of the code can write its outputs to. And then I just send that output back to the browser. (Downside is you have to wait for the program to finish before you get to see the result - you could derive a class from ostream and have the << operator post to the browser directly... nacl should come with a class that does that - I don't know if it actually does...)
On the html/js side, you can then have a textarea and a pre (which I like to call stdin and stdout ;-) ) and a button which posts the content of the textarea to the pexe - And have an eventhandler that writes the messages from the pexe to the pre like this
<embed id='pnacl' type='application/x-pnacl' src='manifest.nmf' width='0' height='0'/>
<textarea id="stdin">Type your input here...</textarea>
<pre id='stdout' width='80' height='25'></pre>
<script>
var pnacl = document.getElementById('pnacl');
var stdout = document.getElementById('stdout');
var stdin = document.getElementById('stdin');
pnacl.addEventListener('message', function(ev){stdout.textContent += ev.data;});
</script>
<button onclick="pnacl.postMessage(stdin.value);">Submit</button>
Congratulations! Your program now runs in the browser!
I am not through with porting my compilers, but it seems like this would even work for stuff that uses flex & bison (you only have to copy FlexLexer.h to the include directory of the pnacl sdk and ignore the warnings about the "register" storage location specifier :-)
Are you using the .pexe in a browser? That's the usual case.
I recommend using nacl_io to emulate POSIX in the browser (also look at file_io. This will allow you to save files locally, retrieve them, in any format you fancy.
To send the output use the browser's usual capabilities such as XMLHttpRequest. You need PNaCl to talk to JavaScript for this, you may want to look at some of the examples.
A regular web server will do, it really depends on what you're doing.

how to send really really large json object as response - node.js with express

I have been getting this error FATAL ERROR: JS Allocation failed - process out of memory and I have pinpointed it to be the problem that I am sending really really large json object to res.json (or JSON.stringify)
To give you some context, I am basically sending around 30,000 config files (each config file has around 10,000 lines) as one json object
My question is, is there a way to send such a huge json object or is there a better way to stream it (like using socket.io?)
I am using: node v0.10.33, express#4.10.2
UPDATE: Sample code
var app = express();
app.route('/events')
.get(function(req, res, next) {
var configdata = [{config:<10,000 lines of config>}, ... 10,000 configs]
res.json(configdata); // The out of memory error comes here
})
After a lot of try, I finally decided to go with socket.io to send each config file at a time rather than all config files at once. This solved the problem of out of memory which was crashing my server. thanks for all your help
Try to use streams. What you need is a readable stream that produces data on demand. I'll write simplified code here:
var Readable = require('stream').Readable;
var rs = Readable();
rs._read = function () {
// assuming 10000 lines of config fits in memory
rs.push({config:<10,000 lines of config>);
};
rs.pipe(res);
You can try increasing the memory node has available with the --max_old_space_size flag on the command line.
There may be a more elegant solution. My first reaction was to suggest using res.json() with a Buffer object rather than trying to send the entire object all in one shot, but then I realize that whatever is converting to JSON will probably want to use the entire object all at once anyway. So you will run out of memory even though you are switching to a stream. Or at least that's what I would expect.

how to pass data from python to an HTML page

I am currently working on a program in python which keeps trak of a lot of builds.
This progress is reported in a html file which is current deleted and written again with the new data for me to update it. However I know this is not the idea approach though i have never really worked with html web pages in any degree what so ever.
But i would like to make it the right way but every time i search on something i just get how to pass data from a html page into a program. So what i am asking for is some guide lines as to what I need to look for to achive this.
Also any suggestion as to how the right approach whould be. I just need to pass data from my running python program to a html page.
Ad if im lucky maybe someone posseses a very simple exsample. like just a python constant passed to a text box or what ever on a html page.
Regards
Ephreal
#!/usr/bin/python3
__author__ = 'Aidan'
from urllib import request
goog_url = 'http://real-chart.finance.yahoo.com/table.csv?s=GOOG&d=8&e=2&f=2014&g=d&a=2&b=27&c=2014&ignore=.csv'
def dl(csv_url, dest):
response = request.urlopen(csv_url)
csv_str = str(response.read())
lines = csv_str.split("\\n")
fx = open(dest, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
dl(goog_url, 'a.csv')
This will take a file from the web band save it to a filename of your choice.

Consuming SSIS Data Profile XML

I am attempting to read the output of an SSIS Data Profile task into an MVC app. To work out the kinks, I wrote a small console app to test the parsing of the xml file.
I used the following link:
http://schemas.microsoft.com/sqlserver/2008/DataDebugger/DataProfile.xsd
to download the .XSD file that should describe the .XML file that was generated in the Data Profile output file.
I then ran xsd.exe to create a C# class to include in my console app.
Following is my very simple test code:
XmlSerializer xser = new XmlSerializer(typeof(DataProfile));
DataProfile dProf = xser.Deserialize(new FileStream(#"D:\InputFiles\ProfilerDataCVD.XML", FileMode.Open)) as DataProfile;
if (dProf != null)
{
var profs = dProf.DataProfileOutput.Profiles;
foreach (ColumnValueDistributionProfileType c in profs)
{
Console.WriteLine(string.Format("Column Name: {0}, RowCount: {1}, Distinct Values: {2}", c.Column.Name, c.Table.RowCount, c.NumberOfDistinctValues));
}
}
In that code, "dProf" is never NULL, but always empty. Any assistance at getting data in dProf would possibly save a life, because I am about to jump off of a cliff trying to figure this out!
If there is some obvious XML thing that I am missing, I will be the first to admit that this is not my strongest suit. Feel free to chastise me at will as long as you tell me how to make this return data.
Regrettably, no one has been able to answer this question. And I would still really like to understand why something so simple does not work.
In the meantime, anyone else struggling with the same issue should check out the following link on MSDN forums for an alternative way of doing the same thing.
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/a282bb60-c099-4656-bf71-52ddc6153c28
I implemented it yesterday in just a few minutes and it works great.

Perl HTML file upload issue. File has zero size

I have a perl CGI script, that works, to upload a file from a PC to a Linux server.
It works exactly as intended when I write the call to the CGI in my own HTML form and then execute, but when I put the same call into an existing application, the file is created on the server, but does not get the data, it is size zero.
I have compared environment variables (those I can extract from %ENV) and nothing there looks like a cause. I actually tried changing several of the ENV in my own HTML script, to the values the existing application was using, and this did not reveal the problem.
Nothing in the log gives me a clue, the upload operation thinks it was successful.
The user is the same for both tests. If permissions were an issue, then the file would not even be created on the server.
Results are the same in IE as in Chrome (works from my own HTML script, not from within the application).
What specific set up should I be looking at, to compare?
This is the upload code:
if (open(UPLOADFILE, ">$upload_dir/$fname")) {
binmode UPLOADFILE;
while (<$from_fh>) {
print UPLOADFILE;
}
close UPLOADFILE;
$out_msg = "Done with Upload: upload_dir=$upload_dir fname=$fname";
}
else {
$out_msg = "ERROR opening for upload: upload_dir=$upload_dir filename=$filename";
}
I did verify that
It does NOT enter the while loop, when running from inside the application.
It does enter the while loop, when called from my own HTML script.
The value of $from_fh is the same for both runs.
All values, used in the below block, are exactly the same for both runs.
You could check the error result of your open?
my $err;
open(my $uploadfile, ">", "$upload_dir/$fname") or $err = $!;
if (!$uploadfile) {
my $out_msg = "ERROR opening for upload: upload_dir=$upload_dir filename=$filename: $err";
}
else {
### Stuff
...;
}
My guess, based on the fact you are embedding it in another application, is that all the input has been read already by some functionality that is part of the other application. For example, if I tried to use this program as part of a CGI script, and I had used the param() function from CGI.pm, then the entire file upload would have been read already. So if my own code tried to read the file again, it would receive zero data, because the data would have been ready already.