Why can't I read a JSON file with a different program while my Processing sketch is still open? - json

I'm writing data to a JSON file in Processing with the saveJSONObject command. I would like to access that JSON file with another program (MAX/MSP) while my sketch is still open. The problem is, MAX is unable to read from the file while my sketch is running. Only after I close the sketch is MAX able to import data from my file.
Is Processing keeping that file open somehow while the sketch is running? Is there any way I can get around this problem?

It might be easier to stream your data straight to MaxMSP using the OSC protocol. On the Processing side, have a look at the oscP5 library and on the Max side at the udpreceive object.
You could send your JSON object as a string and unpack that in Max (maybe using the JavaScript support already present in Max), but it might be simpler to mimic the structure of your JSON object as the arguments of the OSC message object which you simply umpack in Max directly.

Probably, because I/O is usually buffered (notably for performance reasons, ans also because the hardware is doing I/O by blocks).
Try to flush the output channel, perhaps using PrintWriter::flush or something similar.
Details are implementation specific (and might be operating system specific).

Related

How to use pyarrow record batch to communicate across different processes or even different hosts

I read about recordBatch in pyarrow and am very interested. (I am a novice in the pyarrow world). I am wondering if I could use it to communicate between two different processes or different hosts. I am confused because all the examples/documentation I would find online are use cases within the same process, e.g. https://arrow.apache.org/docs/python/ipc.html and https://wesmckinney.com/blog/arrow-streaming-columnar/ Basically, they are doing this:
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, batch.schema)
writer.write_batch(batch)
writer.close()
source = sink.getvalue()
reader = pa.ipc.open_stream(source)
# reader do job...
Two questions:
If I have two different processes or two different hosts, one writes and another one reads, how do I pass the "source" or some kind of handle? I tried with a vanilla file but it doesn't seem to work. Maybe I am doing sth wrong. For example,
writer = pa.ipc.new_file('tmp.record_batch', batch.schema)
writer.write_batch(batch)
# works so far, but next:
reader = pa.ipc.open_file('tmp.record_batch')
ArrowInvalid: Not an Arrow file
Is there a SWMR (single-write-multiple-non-blocking-reader) mode?
Thanks in advance!
The examples are demonstrating in-process for simplicity, but they are working with the Arrow IPC format, which you can send between processes or machines. Think of this in terms of a Parquet, Avro, or CSV file - you can write this out in one process, then read it in another process. It's up to you how these processes communicate otherwise.
For your attempt, you need to close the file to write out the footer before you can open it in another process. You could also try the stream format which doesn't depend on a footer and should be readable incrementally (assuming you flush the file each time). This may even work over something like a domain socket depending on what you're trying to accomplish.
However, it sounds like you're interested in something more like a client-server setup? In that case, Arrow Flight may be more appropriate, especially once you have two different hosts. Flight is an RPC framework specialized for Arrow data transfer.

Need help downloading and reading a zipped CSV file in memory with Clojure

I have an external site from which I want to download a zipped CSV file. Currently, I'm downloading it unzipped, saving it to disk, then unzipping it, saving the unzipped file to disk, then reading the unzipped file with the CSV reader. Lots of useless steps in the process can be trimmed out, and I went on my way to do so.
This amazing answer helped me to get myself going. I tried to use the first option linked there (GZIPInputStream), but I get a "Not GZIP format" error, so I suppose I have to go to the second option.
This is my current code, and it does what I want it to do:
(defn download-zipped-stream!
(:body (clj-http.client/get "www.example.com" {:as :stream})))
(with-open
[stream (ZipInputStream. download-zipped-stream!)]
(.getNextEntry stream)
(doall (clojure.data.csv/read-csv (clojure.java.io/reader stream) :separator \;)))
I literally got to this by trial and error. There are mainly three things I'd like to change / understand about this code.
Ideally, I would want to break my code in two parts: one to download and unzip the content, returning a stream - the reason being that I want to decide later whether I want to read it as a csv directly, or write to disk (I don't want to lose this option, because, during development, it is much easier to read a pre-downloaded csv file than downloading the big content every single time). Turns out that, if I try to access the stream outside of the with-open call, I get a "stream closed" error (which, from what I understand, makes total sense).
On the above code, I have to call this .getNextEntry, or I get an empty list. As someone who is striving to write functional code, this bothers me, because, from what I can understand, I'm dealing with states here - my stream object looks mutable, which is something I really don't want. Isn't there a way to work around this step and straight-up not have it there?
I tried to call the read-csv method directly on the stream object, but the read-csv doesn't really know how to handle ZipInputStreams, apparently. Seeing this, I simply and hopefully throwed an io/reader call in between, and it worked. I don't know if this is the best approach, though. Is it correct?
I'm quite new to Clojure, and I'm completely clueless about Java in general, so, as you can see, my knowledge about those stream objects is pretty limited. I tried to read something about it in Java, but I quitted because I was not sure about how much of it could be useful for someone learning Clojure, so any pointers are also appreciated.
I think you are on the right approach. Suggestions to consider:
Consider using wget to manually download the *.csv.gz file to your local disk. Then, just open that local file instead of using clj-http.client/get.
I haven't played much with ZipInputStream, but if using .getNextEntry() seems to be required, just go with it.
The examples for read-csv show using a Reader to give access to the input file, so this is the expected behavior.
This template project shows how I like to organize a Clojure project & source code. Be sure to peruse the list of documentation provided.
Don't forget to utilize cljdoc.org for looking up Clojure library API docs. For example, see the API docs for data.csv.
Update
You may also want to review this answer.
Use https://github.com/techascent/tech.ml.dataset optionally with https://scicloj.github.io/tablecloth/index.html (a dplyr like api for TMD)
Also has advantage of being extremely fast and able to handle datasets that can't fit in memory, talks SQL, Arrow, et. al. Join conversation about it here:
https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset

Parsing Protocol-Buffers without .proto file

I am reverse-engineering an Android app as part of a security project. My first step is to discover the protocol exchanged between the app and server. I have found that the protocol being used is protocol buffers. Given the nature of protobuf, the original .proto file is needed to be able to unserialize the protobuf-encoded message. Since I don't have that, I used protod to disassemble the Android app and recover out any .proto files used.
I have the Android app in a form where it is a bunch of .smali and .so files. Running protod against the .so files yields only one .proto file -- google/protobuf/descriptor.proto.
I was under the impression that users of protocol buffers write their own .proto files, which might reference google/protobuf/descriptor.proto, but according to protod google/protobuf/descriptor.proto is the only protofile used by the app. Could this actually be possible and google/protobuf/descriptor.proto is enough for me to unserialize the messages between the app and server?
When you write a .proto file you can set an option optimize_for to LITE_RUNTIME (see here) and this will omit the descriptors from the generated code to reduce the size of the binary. I believe this is a common practice for mobile development since code size is a scarce resource in that environment. This may explain why you found only a single .proto file. It is unlikely that the app is actually transferring any data using descriptor.proto since that is mostly an implementation detail of the protocol buffers library.
If you cannot find any other descriptors, your best bet might be to try to interpret the protocol buffers without them. You can read about the protocol buffers wire format here. An easy way to get started would be to create a proto2 message type containing no fields and attempt to parse the data as that type. You can then use the reflection API to examine what are known as the "unknown fields" in the message and try to figure out what they represent.

How to covert a large JSON file into XML?

I have a large JSON file, its size is 5.09 GB. I want to convert it to an XML file. I tried online converters but the file is too large for them. Does anyone know how to to do that?
The typical way to process XML as well as JSON files is to load these files completely into memory. Then you have a so called DOM which allows you various kinds of data processing. But neither XML nor JSON are really designed for storing that much data you have here. To my experience you typically will run into memory problems as soon as you exceed a 200 MByte limit. This is because DOMs are created that are composed from individual objects. This approach results in a huge memory overhead that far exceeds the amount of data you want to process.
The only way for you to process files like that is basically to take a stream approach. The basic idea: Instead of parsing the whole file and loading it into memory you parse and process the file "on the fly". As data is read it is parsed and events are triggered on which your software can react and perform some actions as needed. (For details on that have a look at the SAX API in order to understand this concept in more detail.)
As you stated you are processing JSON, not XML. Stream APIs for JSON should be available in the wild as wel. Anyway you could implement one fairly easily yourself: JSON is a pretty simple data format.
Nevertheless such an approach is not optimal: Typically such a concept will result in very slow data processing because of millions of method invocations involved: For every item encountered you typically need to call a method in order to perform some data processing task. This together with additional checks about what kind of information you currently have encountered in the stream will slow down data processing pretty much.
You really should consider to use a different kind of approach. First split your file into many small ones, then perform processing on them. This approach might not seem to be very elegant, but it helps to keep your task much simpler. This way you gain a main advantage: It will be much easier for you to debug your software. Unfortunately you are not very specific about your problem, so I can only guess, but large files typically imply that the data model is pretty complex. Therefor you will probably be much better off by having many small files instead of a single huge one. And later it allows you to dig into individual aspects of your data and the data processing process as needed. You will probably fail getting any detailed insights into that while having a single large file of 5 GByte to process. On errors you will have trouble to identify which part of the huge file is causing the problem.
As I already stated you unfortunately are very unspecific about your problem. Sorry, but because of having no more details about your problem (and your data in particular) I can only give you these general recommendations about data processing. I do not know any details about your data, so I can not give you any recommendation about which approach will work best in your case.

Perfmon .blg file specification / parsing library

Where can I find a detailed, low-level spec for the Perfmon binary .blg file format? Or even better, has anyone written a low level, open source library (preferably in C, but any language would do) for parsing .blg files?
There's a tool called relog that can convert these files to csv or other formats.
http://blog.bennett-scharf.com/2008/12/17/converting-an-existing-perfmon-blg-file-to-csv/
Link
Link
This won't help for looking at historical data, but if you have access to the systems running Perfmon, you may want to look at Logman. With Logman you can set performance counters AND specify the output format, that way you can just chose a format that is easy to parse. See the -f option:
-f { bin | bincirc | csv | tsv | SQL } : Specifies the file format used for collecting performance counter and trace data. You can use binary, circular binary, comma and tab separated, or SQL database formats when collecting performance counters.
As others have said if you also have historical records you need to parse you can use the Relog utility to convert existing .blg files in to a more useful format.
Another option is to export the perfmon Data Collection Set as a template, and change the log file format in the XML - look for the LogFileFormat tag and change the value to the format of your preference
0 = CSV, 1 = TSV, 2 = SQL, 3 = the default binary format.
I was looking for a way to incorporate PerfMon data into a SIEM, and found that getting perfmon to log to a SQL DB (and reading the data from a SQL view, from the SIEM agent) was the best way of doing this.
I can't say much about other products, but in LogRhythm SIEM, you need a "UDLA" (universal database log adapter) log source for it - and if you want to parse/contextualise the metadata, you'll need some parsing rules (ie regex) for what the query returns.
It's useful to see things like "if there's x number of logon errors, AND Avail MBytes is less than 100, THEN trigger alarm/AIEngine rule 'Insufficient Memory to Process Logons'".
That's a pretty lame example, but you get the idea.
You might also look at other things which have a potentially malicious explanation, and also a benign explanation.
For example - if you see a large amount of failed attempts to reset passwords, this might usually indicate some malicious behaviour - but not if you see the perfmon counters telling you that the Domain Controller has a total of less than 1,000 free system PTEs (admittedly unlikely on a 64-bit OS), or is seeing CPU usage of more than 95%. In which case, it's not necessarily a security issue, it's a load/capacity issue - or something is very wrong with your DC.