can we open gz file with Tcl_FSOpenFileChannel - tcl

Can we open gz file with Tcl_FSOpenFileChannel api https://linux.die.net/man/3/tcl_fsopenfilechannel

You can open the file and see the compressed data within it.
Decompression is done by stacking on a decompressor, likely gunzip for GZ-format data. The API for attaching one to a Tcl channel is currently only a Tcl script-level API; using it requires registering the channel (Tcl_RegisterChannel) to let Tcl scripts in that interpreter see that channel, and then running zlib push gzip to do the registration.
Tcl_Channel chan = Tcl_FSOpenFileChannel(interp, theFileNameObj, "rb", 0);
// Should check for error here (NULL == chan), of course
Tcl_RegisterChannel(interp, chan);
char buffer[128]; // plenty of space; channel names aren't *that* long
sprintf(buffer, "zlib push gunzip %s", Tcl_GetChannelName(chan));
Tcl_Eval(interp, buffer);
// Ought to check result of Tcl_Eval for TCL_ERROR
// Use the file here
Tcl_Close(NULL, chan);
You can use the Tcl zlib support library functions to do decompression (there's both a bulk and a streaming API described on that page), but attaching to a channel isn't one of the options. (I've added a ticket to remind someone to make this nicer.)

Related

Memory issues in a Cloud Storage Function

I have deployed a storage trigger cloud function that needs more memory. While deploying the GCF, I have deployed in the following manner with the appropriate flags.
gcloud functions deploy GCF_name--runtime python37 --trigger-resource bucket_name --trigger-event google.storage.object.finalize --timeout 540s --memory 8192MB
But I observed in the google cloud console, the memory utilization map is not going beyond 2GB. And in the logs I am getting this error, Function execution took 34566 ms, finished with status: 'connection error' which happens because of memory shortage. Can I get some help on this.
Edited
The application uploads text files to the storage that contains certain number of samples. Each file is read when it is uploaded to the storage and the data appended to a pre existing file. The total number of samples will be maximum of 75600002. That's why I need 8GB data. Its giving the connection error while appending the data to the file.
def write_to_file(filename,data,write_meta = False,metadata = []):
file1 = open('/tmp/'+ filename,"a+")
if write_meta:
file1.write(":".join(metadata))
file1.write('\n')
file1.write(",".join(data.astype(str)))
file1.close()
The memory utilisation map was the same after every upload.
You are writing a file to /tmp which is an in-memory filesystem. So start by deleting that file when you finish uploading it. In fact those:
Files that you write consume memory available to your function, and sometimes persist between invocations. Failing to explicitly delete these files may eventually lead to an out-of-memory error and a subsequent cold start.
Ref : https://cloud.google.com/functions/docs/bestpractices/tips#always_delete_temporary_files

Streaming a virtual terminal without video

If I want to live broadcast some work on a text editor embedded in a virtual terminal on my personnal computer, I can stream on the web a video of the window containing it.
But since information consists mainly in a bunch of characters, possibly with some colors and formating, I think that video is a waste of ressources, bandwidth and technology speaking.
What would you recommend for this, and is there some server implementing the solution somewhere ?
The requirements are :
the stream must be almost real time (at least 1 update per second and no more than 1 second delay)
audience can access the stream with only a web browser (no additional software on their side), read-only (no interaction with the stream or with my terminal)
features from say xterm or urxvt be supported
all necessary software (both streamer client side and potential server side) are open source
Comments on technical advantages of such tool compared to video streaming are welcome.
I finally took the time to implement a complete solution, using socket.io within a simple NodeJS server for broadcasting.
On the client side, serve a simple HTML with an Xterm.js terminal
<script src='/socket.io/socket.io.js'></script>
<script src="xterm/xterm.js"></script>
...
<div class="terminal" id="terminal"></div>
and script the synchronization along the lines of
var socket = io();
term.open(document.getElementById('terminal'));
var updateTerminal = socket.on("updateTerminal", function(data) {
term.write(data);
});
Now the data that can be passed to term.write of Xterm.js can be raw terminal data. Several UNIX utilities can monitor such raw data from a terminal, for instance tmux as proposed by jerch in the comments, or script.
To pass these data to the server for broadcasting, the easiest way is to use a named pipe; so on the server side
mkfifo server_pipe
script -f server_pipe
(the terminal issuing that last command will be the one broadcasting; if one does not have physical access to the server, one can use an additional pipe and a tunneling connection
mkfifo local_pipe
cat local_pipe | ssh <server> 'cat > path/to/server_pipe'&
script -f local_pipe
)
Finally, the NodeJS server must be listening to the named pipe and broadcast any new data
/* create server */
const http = require('http');
const server = http.createServer(function (request, response) {
...
});
/* open named pipe for reading */
const fs = require('fs');
const fd = fs.openSync("path/to/server_pipe", 'r+')
const termStream = fs.createReadStream(null, {fd});
termStream.setEncoding('utf8');
/* broadcast any new data with socket.io */
const iolib = require("socket.io");
io = iolib(server);
termStream.on('data', function(data) {
io.emit("updateTerminal", data)
});
All this mechanism is implemented in my software Remote lecture.
As for the comparison with video broadcast, I did not take the time to actually quantify the difference, but for equivalent resolution and latency, the above mechanism should use much less network and computing resources than capturing a graphic terminal output and sharing it with video.

Importing and parsing calendars in OCaml

I am working on a scheduling/planning program in OCaml and I want to be able to use an iCal file as an input, but I can't figure out how to parse the file into my own calendar type in OCaml. Ideally, I want to be able to read an iCal file in the same way that you can read a json file using Yojson. Any ideas for how I could accomplish this?
If you're talking about the ICalendar format then there is the OCaml library icalendar that can read it. You can install it with
opam install icalendar
It is pretty undocumented so here is an example program that will read and print back a calendar.
open Format
let read filename =
let buf = Buffer.create 4096 in
let src = open_in filename in
let rec loop () = loop (Buffer.add_channel buf src 4096) in
try loop () with End_of_file -> close_in src; Buffer.contents buf
let main filename =
match Icalendar.parse (read filename) with
| Error failure ->
eprintf "Failed to read file %s %s#\n%!" filename failure
| Ok calendar ->
printf "%a#\n%!" Icalendar.pp calendar
let () = main Sys.argv.(1)
Note, that I also had to write the read function that will read the whole file into a string. This function is not a part of the standard library but is commonly provided by other libraries, e.g., Base, Core, Batteries.
To build the program, create an empty folder, put the code into a file, e.g., example.ml and then issue the following command in that folder:
ocamlbuild -pkg icalendar example.native
You can then use the built binary as
./example.native input.ics
where input.ics is the sample input.

Extracting the outputs/results from an executed .pexe file

My goal is to convert a C++ program in to a .pexe file in order to execute it later on a remote computer. The .pexe file will contain some mathematical formulas or functions to be calculated on a remote computer, so I’ll be basically using the computational power of the remote computer. For all this I’ll be using the nacl_sdk with the Pepper library and I will be grateful if someone could clarify some things for me:
Is it possible to save the outputs of the executed .pexe file on the remote computer in to a file, if it’s possible then how? Which file formats are supported?
Is it possible to send the outputs of the executed .pexe file on the remote computer automatically to the host computer, if it’s possible then how?
Do I have to install anything for that to work on the remote computer?
Any suggestion will be appreciated.
From what I've tried it seems like you can't capture the stuff that your pexe writes to stdout - it just goes to the stdout of the browser (it took me hours to realize that it does go somewhere - I followed a bad tutorial that had me believe the pexes stdout was going to be posted to the javascript side and was wondering why it "did nothing").
I currently work on porting my stuff to .pexe also, and it turned out to be quite simple, but that has to do with the way I write my programs:
I write my (C++) programs such that all code-parts read inputs only from an std::istream object and write their outputs to some std::ostream object. Then I just pass std::cin and std::cout to the top-level call and can use the program interactively in the shell. But then I can easily swap out the top-level call to use an std::ifstream and std::ofstream to use the program for batch-processing (without pipes from cat and redirecting to files, which can be troublesome under some circumstances).
Since I write my programs like that, I can just implement the message handler like
class foo : public pp::Instance {
... ctor, dtor,...
virtual void HandleMessage(const pp::Var& msg) override {
std::stringstream i, o;
i << msg.AsString();
toplevelCall(i,o);
PostMessage(o.str());
}
};
so the data I get from the browser is put into a stringstream, which the rest of the code can use for inputs. It gets another stringstream where the rest of the code can write its outputs to. And then I just send that output back to the browser. (Downside is you have to wait for the program to finish before you get to see the result - you could derive a class from ostream and have the << operator post to the browser directly... nacl should come with a class that does that - I don't know if it actually does...)
On the html/js side, you can then have a textarea and a pre (which I like to call stdin and stdout ;-) ) and a button which posts the content of the textarea to the pexe - And have an eventhandler that writes the messages from the pexe to the pre like this
<embed id='pnacl' type='application/x-pnacl' src='manifest.nmf' width='0' height='0'/>
<textarea id="stdin">Type your input here...</textarea>
<pre id='stdout' width='80' height='25'></pre>
<script>
var pnacl = document.getElementById('pnacl');
var stdout = document.getElementById('stdout');
var stdin = document.getElementById('stdin');
pnacl.addEventListener('message', function(ev){stdout.textContent += ev.data;});
</script>
<button onclick="pnacl.postMessage(stdin.value);">Submit</button>
Congratulations! Your program now runs in the browser!
I am not through with porting my compilers, but it seems like this would even work for stuff that uses flex & bison (you only have to copy FlexLexer.h to the include directory of the pnacl sdk and ignore the warnings about the "register" storage location specifier :-)
Are you using the .pexe in a browser? That's the usual case.
I recommend using nacl_io to emulate POSIX in the browser (also look at file_io. This will allow you to save files locally, retrieve them, in any format you fancy.
To send the output use the browser's usual capabilities such as XMLHttpRequest. You need PNaCl to talk to JavaScript for this, you may want to look at some of the examples.
A regular web server will do, it really depends on what you're doing.

SOLR - Best approach to import 20 million documents from csv file

My current task on hand is to figure out the best approach to load millions of documents in solr.
The data file is an export from DB in csv format.
Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.
I have noticed that if u post high amount of data, most of the time the request times out.
I am looking into Data importer and it seems like a good option
Any others ideas highly appreciated
Thanks
Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.
As I see it, you have a couple of options (In my order of preference):
Increase container timeout
Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).
I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.
Split the file
Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):
split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary #$file
done
Parse the file and load in chunks
The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.
import au.com.bytecode.opencsv.CSVReader
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
#Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
new File("data.csv").withReader { reader ->
CSVReader csv = new CSVReader(reader)
String[] result
Integer count = 1
Integer chunkSize = 500000
while (result = csv.readNext()) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", result[0])
doc.addField("name_s", result[1])
doc.addField("category_s", result[2])
server.add(doc)
if (count.mod(chunkSize) == 0) {
server.commit()
}
count++
}
server.commit()
}
In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki
curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8
And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.
Above answers have explained really well the ingestion strategies from single machine.
Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.
Use sqoop to bring data to hadoop or place your csv file manually in hadoop.
Use one of below connector to ingest data:
hive- solr connector, spark- solr connector.
PS:
Make sure no firewall blocks connectivity between client nodes and solr/solrcloud nodes.
Choose right directory factory for data ingestion, if near real time search is not required then use StandardDirectoryFactory.
If you get below exception in client logs during ingestion then tune autoCommit and autoSoftCommit configuration in solrconfig.xml file.
SolrServerException: No live SolrServers available to handle this
request
Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.
The reference guide says ConcurrentUpdateSolrServer could/should be used for bulk updates.
Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):
ConcurrentUpdateSolrServer buffers all added documents and writes them into open HTTP connections.
It doesn't buffer indefinitely, but up to int queueSize, which is a constructor parameter.