Best practices in using pickled files across threads - pickle

I am currently using a threaded application where there is a daemon who is periodically reading and writing to/from pickle files and there are different threads that would try to read the contents of such files.
Are there best practices in terms of how to set it up properly? I have checked that I have aligned 'rb' and 'wb' for reading and writing pickle files but I still occasionally receive 'EOFError: Ran out of input'. I am wondering whether it is because two threads are reading and writing into the same file concurrently. If that is not allowed I would appreciate if someone can guide me the direction how to proceed with this.

Related

Does the IPFS JavaScript client support stream (chunked) uploads?

Is there a way to upload a file to IPFS chunk-by-chunk using the js-ipfs-http-client package? To improve the concurrency model of my application, I’d like to avoid holding complete files in memory and instead just work with the chunks.
In the source, I can see a method called start is not implemented. Is this supposed to enable this behavior in the long run?
I asked this question on discuss.ipfs.io as well a little while back.

How to covert a large JSON file into XML?

I have a large JSON file, its size is 5.09 GB. I want to convert it to an XML file. I tried online converters but the file is too large for them. Does anyone know how to to do that?
The typical way to process XML as well as JSON files is to load these files completely into memory. Then you have a so called DOM which allows you various kinds of data processing. But neither XML nor JSON are really designed for storing that much data you have here. To my experience you typically will run into memory problems as soon as you exceed a 200 MByte limit. This is because DOMs are created that are composed from individual objects. This approach results in a huge memory overhead that far exceeds the amount of data you want to process.
The only way for you to process files like that is basically to take a stream approach. The basic idea: Instead of parsing the whole file and loading it into memory you parse and process the file "on the fly". As data is read it is parsed and events are triggered on which your software can react and perform some actions as needed. (For details on that have a look at the SAX API in order to understand this concept in more detail.)
As you stated you are processing JSON, not XML. Stream APIs for JSON should be available in the wild as wel. Anyway you could implement one fairly easily yourself: JSON is a pretty simple data format.
Nevertheless such an approach is not optimal: Typically such a concept will result in very slow data processing because of millions of method invocations involved: For every item encountered you typically need to call a method in order to perform some data processing task. This together with additional checks about what kind of information you currently have encountered in the stream will slow down data processing pretty much.
You really should consider to use a different kind of approach. First split your file into many small ones, then perform processing on them. This approach might not seem to be very elegant, but it helps to keep your task much simpler. This way you gain a main advantage: It will be much easier for you to debug your software. Unfortunately you are not very specific about your problem, so I can only guess, but large files typically imply that the data model is pretty complex. Therefor you will probably be much better off by having many small files instead of a single huge one. And later it allows you to dig into individual aspects of your data and the data processing process as needed. You will probably fail getting any detailed insights into that while having a single large file of 5 GByte to process. On errors you will have trouble to identify which part of the huge file is causing the problem.
As I already stated you unfortunately are very unspecific about your problem. Sorry, but because of having no more details about your problem (and your data in particular) I can only give you these general recommendations about data processing. I do not know any details about your data, so I can not give you any recommendation about which approach will work best in your case.

Hooking my own filesystem functions for RTEMS

I am working on a file system for the RTEMS OS. My question is how can I have the opendir, readdir, and closedir functions found in the unistd.h library redirect to my own opendir, readdir, and closedir functions for my file system? i.e. How can I hook them into RTEMS?
Thanks,
You would have gotten a quicker response from the rtema mailing list. The answer is that each filesystem has to provide a structure with a set of entry points which largely correspond to POSIX file I/O system calls. There are multiple filesystem types already in RTEMS and you should just follow the same pattern.
So look in cpukit/libfs and ask more on rtems.org

What are logging libraries for?

This may be a stupid question, as most of my programming consists of one-man scientific computing research prototypes and developing relatively low-level libraries. I've never programmed in the large in an enterprise environment before. I've always wondered, what are the main things that logging libraries make substantially easier than just using good old fashioned print statements or file output, simple programming logic and a few global variables to determine how verbosely things get logged? How do you know when a few print statements or some basic file output ain't gonna cut it and you need a real logging library?
Logging helps debug problems especially when you move to production and problems occur on people's machines you can't control. Best laid plans never survive contact with the enemy, and logging helps you track how that battle went when faced with real world data.
Off the shel logging libraries are easy to plug in and play in less than 5 minutes.
Log libraries allow for various levels of logging per statement (FATAL, ERROR, WARN, INFO, DEBUG, etc).
And you can turn up or down logging to get more of less information at runtime.
Highly threaded systems help sort out what thread was doing what. Log libraries can log information about threads, timestamps, that ordinary print statements can't.
Most allow you to turn on only portions of the logging to get more detail. So one system can log debug information, and another can log only fatal errors.
Logging libraries allow you to configure logging through an external file so it's easy to turn on or off in production without having to recompile, deploy, etc.
3rd party libraries usually log so you can control them just like the other portions of your system.
Most libraries allow you to log portions or all of your statements to one or many files based on criteria. So you can log to both the console AND a log file.
Log libraries allow you to rotate logs so it will keep several log files based on many different criteria. Say after the log gets 20MB rotate to another file, and keep 10 log files around so that log data is always 100MB.
Some log statements can be compiled in or out (language dependent).
Log libraries can be extended to add new features.
You'll want to start using a logging libraries when you start wanting some of these features. If you find yourself changing your program to get some of these features you might want to look into a good log library. They are easy to learn, setup, and use and ubiquitous.
There are used in environments where the requirements for logging may change, but the cost of changing or deploying a new executable are high. (Even when you have the source code, adding a one line logging change to a program can be infeasible because of internal bureaucracy.)
The logging libraries provide a framework that the program will use to emit a wide variety of messages. These can be described by source (e.g. the logger object it is first sent to, often corresponding to the class the event has occurred in), severity, etc.
During runtime the actual delivery of the messaages is controlled using an "easily" edited config file. For normal situations most messages may be obscured altogether. But if the situation changes, it is a simpler fix to enable more messages, without needing to deploy a new program.
The above describes the ideal logging framework as I understand the intention; in practice I have used them in Java and Python and in neither case have I found them worth the added complexity. :-(
They're for logging things.
Or more seriously, for saving you having to write it yourself, giving you flexible options on where logs are store (database, event log, text file, CSV, sent to a remote web service, delivered by pixies on a velvet cushion) and on what is logged at runtime, rather than having to redefine a global variable and then recompile.
If you're only writing for yourself then it's unlikely you need one, and it may introduce an external dependency you don't want, but once your libraries start to be used by others then having a logging framework in place may well help your users, and you, track down problems.
I know that a logging library is useful when I have more than one subsystem with "verbose logging," but where I only want to see that verbose data from one of them.
Certainly this can be achieved by having a global log level per subsystem, but for me it's easier to use a "system" of some sort for that.
I generally have a 2D logging environment too; "Info/Warning/Error" (etc) on one axis and "AI/UI/Simulation/Networking" (etc) on the other. With this I can specify the logging level that I care about seeing for each subsystem easily. It's not actually that complicated once it's in place, indeed it's a lot cleaner than having if my_logging_level == DEBUG then print("An error occurred"); Plus, the logging system can stuff file/line info into the messages, and then getting totally fancy you can redirect them to multiple targets pretty easily (file, TTY, debugger, network socket...).

Reverse engineering a custom data file

At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.
Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.
I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.