How to efficiently save complex Fortran structures into binary files? - json

So, I need to implement my new project in Fortran 2008. In this project I need to store large structures and I'd like to store them in binary format. So far I've read here both about unformatted and formatted binary streams, but I'm still a little confused about how to use them to store something more complex than several numbers/strings.
Let's say I'll have my data stored in JSON format:
"net_name": "Test ANN",
"net_type": "Feed-forward",
"train_method": "Back-propagation",
"neurons":{ "1": { "inputs":["2", "3"],
"outputs": ["10", "11"]},
"2": { "inputs":["4", "5"],
"outputs": ["11", "12"]}
}
I know, that I can use some 3rd party modules for working with JSON directly (like this one), but I'd like to be able to use also binary format because file sizes can be really huge.
So, is it possible to store such structure using Fortran binary streams? Or is there any other way in Fortran, how to do it? I'm not asking about unformatted binary files exclusively, if there is some other elegant solution, I'll be glad to hear about it. Still, I prefer ones without many external dependencies.

Related

I need to understand JSON without technical jargon [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am currently exploring the usages of JSON. I have read many posts, articles, YouTube videos. Yet, I still don't understand the purpose (it's been a month), practicality of JSON. No definition is suiting my logic to understand this concept in order to comfortably implement it.
What I understand (brief overall understanding): JSON provides an easier way to format data and send across networks.
My question: Could someone provide me a comprehensible storyline with JSON in action as I am struggling to understand its practicality. I hope this question makes sense, if not, I can try and re-word it.
Edit for #Philipp: Yes, I do have experience with reading Text-based files with Java (Mainly with assignments at Uni). No, I do not have experience with any competing technologies such as XML or YAML. Consciously, I believe JSON to be 'Cookies' in a sort of way but this most likely is wrong. I hope this helps and look forward to your explanation? Maybe it might help me understanding it.
JSON is, overly simplified, a standard for how to structure your own file formats. "File" does not necessarily mean a file stored in a filesystem. It can also be an ephemeral file which is created on one computer, sent to a different computer via network, gets processed and then discarded without ever storing it. But thinking of it as a file format makes things easier.
A JSON-based file format includes a document in a key-value structure. Every value has a key. Every value can either be a string, a number, another key-value structure or a list of the things mentioned before. Here is an example based on the one from the wikipedia article on JSON:
{
"firstName": "John",
"lastName": "Smith",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
]
}
This file describes a person who has a first name, a last name, one address consisting of a street address, city, state and postal code, and a list of phone numbers, with each phone number having a type and a number.
OK, but there are certainly other ways to store that kind of information. Ways which might be more concise. So why would you choose to invent a file format based on JSON instead of just starting from scratch?
Library support. There are lots of libraries available for parsing and writing JSON. If you ever wrote a file parsing routine yourself, then you know how much of a PITA those can be. There are a ton of edge-cases you have to keep in mind to prevent your program from crashing or reading garbage data. A JSON library takes care of all of these edge-cases for you. This makes it a lot easier for you to create programs working with JSON data than when you invent your own file format.
Tool support. There are editors available which can edit any form of JSON data in a handy UI. For example, did you notice that Stackoverflow automatically added syntax highlighting to the JSON code above? I didn't do anything to make that happen. Stackoverflow just automatically recognized that it is JSON and colored it accordingly. That would not be possible with a homebrewed file format.
Good compromise between machine-readability and human-readability. The format above is not just easy to read for programs (thanks to the aforementioned library support) but also pretty readable and editable for humans. People can intuitively understand the format and edit it in a text editor without breaking stuff. Especially when they worked with JSON-based file formats before.
Forward- and backward compatibility of file formats. This is something you could technically achieve in your own file format, but JSON makes it a lot easier. Imagine you create version 2.0 of your program, which comes with a version 2.0 of the file format. Your documents now have some additional fields. Handling this in homebrewed text-based formats can be really difficult. But the key-value structure of JSON makes it pretty easy to recognize that certain keys are missing and then replace their values with reasonable defaults. Similarly, the 1.0 version of your program might make limited sense of 2.0 documents by simply ignoring any keys it doesn't understand yet.
Interoperability with JavaScript. This might be kind of situational, but the reason why you see JSON being used a lot in the context of web applications is that JSON is actually valid JavaScript. That means that when you have a browser-based application, converting to and from JavaScript Objects to JSON text and vice versa is trivial. That makes it a preferred choice for exchanging data between browser-based applications and servers. The result is that you see a lot of JSON in cookies or webservice requests (although none of these mandate the use of JSON).
JSON or (JavaScript Object Notation) is simply a lightweight, semi-structured way of representing a set of data.
One sample Storyline:
Let's say you are creating an application that needs to communicate with another application and you want to make it easier for other applications to consume the data your application provides.
There are a lot of ways to do this, but by using JSON you make the process more simple (the applications that consume your data can figure out how to read your data on their own - if they want to) AND you cut down on the amount of raw data that is being passed around.
To answer your question is very simple and lightweight compared to other communication methods like SOAP or connecting straight to the database where you hold your data.

How should a "project" file be written?

With popular software packages, like Microsoft Word or Photoshop, we often have an option to save our progress as a "project" file and later can open that file to edit our works furthermore. This file often contains all the options and the progress that the user has made (i.e the essay you typed in Word).
So my question is, if I am doing a similar application that requires creating a similar "project" file, how should I go about this? My application is a scientific application, which means it required a lot of (multi-dimension) arrays. I understand there will be a lot of options to do this, but I would like to know the de facto way.
Here are some of the options I have outline out:
XML: Human readable. The size is too big and it's too much work to deal with arrays.
JSON: More popular/modern. Good with array.
Protocol Buffer: It is created by Google. Probably faster.
Database: Probably not a good use case since "project" files are most likely "temporary". Also, working with arrays is not very straight forward.
Creating your own binary format: Might be the most difficult solution for an inexperienced programmer like myself.
???
I would like to get some advice from you guys. Thank you :).
(Good question. :) Only some thoughts) I'd prefer text format for the main project file. You can make diffs and open and read and modify it easily. Large ascii or binary data can be stored as serialized data in external files or in a database like SQLite from where it can be easily accessed and processed through the application. The main project has links to the external data store. My advice for the main project file is a simple XML format that can easily be transformed to JSON format. A list of key value pairs (dict) is good for the beginning. value can be of basic datatype or be an array or dict. A complicated XML tree is not good. The key name can also help to describe and structure data. So i'd prefer key="rect.4711.pos.x" value="500" and not <rect id="4711"><pos><x>500</x>...</pos>.... Important aspect is that the project data is portable and self-contained, and the user can see the project as a single unit even if it is a directory on the file system, for this purpose supporting some kind of zipped format of project data is good.

How should I process nested data structures (e.g. JSON, XML, Parquet) with Dask?

We often work with scientific datasets distributed as small (<10G compressed), individual, but complex files (xml/json/parquet). UniProt is one example, and here is a schema for it.
We typically process data like this using Spark since it is supported well. I wanted to see though what might exist for doing work like this with the Dataframe or Bag APIs. A few specific questions I had are:
Does anything exist for this other than writing custom python functions for Bag.map or Dataframe/Series.apply?
Given any dataset compatible with Parquet, are there any secondary ecosystems of more generic (possibly JIT compiled) functions for at least doing simple things like querying individual fields along an xml/json path?
Has anybody done work to efficiently infer a nested schema from xml/json? Even if that schema was an object that Dask/Pandas can’t use, simply knowing it would be helpful for figuring out how to write functions for something like Bag.map. I know there are a ton of Python json schema inference libraries, but none of them look to be compiled or otherwise built for performance when applied to thousands or millions of individual json objects.

What is the best file format to parse?

Scenario: I'm working on a rails app that will take data entry in the form of uploaded text-based files. I need to parse these files before importing the data. I can choose the file type uploaded to the app; the software (Microsoft Access) used by those uploading has several export options regarding file type.
While it may be insignificant, I was wondering if there is a specific file type that is most efficiently parsed. This question can be viewed as language-independent, I believe.
(While XML is commonly parsed, it is not a feasible file type for sake of this project.)
If it is something exported by Access, the easiest would be CSV; particularly since Ruby contains a CSV parser in the standard library. You will have to do some work determining the dialect of CSV (what it uses for delimiter, how it handles quotes); I don't know how robust the ruby parser is with those issues, but you also should have some control from Microsoft Access.
You might want to take a look at JSON. It's a lightweight format, and in contrast to XML it's really easy and clean to parse without requiring a huge library on the backend.
It can represent types like strings, numbers, assosiative arrays (objects), and lists of such
I would suggest n-SV (where n is some character) for data that does not include n. That will make lexing the files a matter of a split.
If you have more flexible data, I would suggest JSON.
If you've HAVE to roll your own parser, I would suggest CSV or some form of a delimiter separated format.
If you are able to use other libraries, there are plenty of options. JSON looks quite fascinating.

How to analyze binary file?

I have a binary file. I don't know how it's formatted, I only know it comes from a delphi code.
Does it exist any way to analyze a binary file?
Does it exist any "pattern" to analyze and deserialize the binary content of a file with unknown format?
Try these:
Deserialize data: analyze how it's compiled your exe (try File Analyzer). Try to deserialize the binary data with the language discovered. Then serialize it in a xml format (language-indipendent) that every programming language can understand
Analyze the binary data: try to save various versions of the file with little variation and use a diff program to analyze the meaning of every bit with an hex editor. Use it in conjunction with binary hacking techniques (like How to crack a Binary File Format by Frans Faase)
Reverse Engineer the application: try getting code using reverse engineering tools for the programming language used for build the app (found with File Analyzer). Otherwise use disassembler analysis tool like IDA Pro Disassembler
For my hobby project I had to reverse engineer some old game files. My approaches were:
Have a good hex editor.
Look for readable words in the binary file. Note how their distribution is. If the distance between them is constant you know it is a listing.
Look for 2-3 consequent zeros. Might indicate an int32 value.
Some dwords might be pointers into the file.
Try to identify reoccurring patterns in the file.
Seeing lots of C0-CF might indicate RLE compressed data.
I've developed Hexinator (Window & Linux) and Synalyze It! (macOS) exactly for this purpose. These applications allow you to see the binary files like in other hex editors but additionally you can create a "grammar" with the specifics of a binary file format. The grammar contains all the building blocks and is used to parse the file automatically.
Thus you can keep the knowledge you gain in the analysis and apply it to multiple files simultaneously. You can also color-code the bits and pieces of file formats for a quick overview in the hex editor.
The parsing results are displayed in a tree view where you can also modify the files easily (applying endianness et cetera).
Reverse engineering a binary file when you have some idea of what it represents is a very time consuming process. If you have no idea what it is then it will be even harder.
It is possible though, but you have to have a pretty good reason for doing so.
The first step would be to open it up in a hex editor of your choice and see if you can find any English text to point you in the direction of what the file is even supposed to represent. From there, Google "Reverse Engineering binary files", there are much more knowledgeable people than me that have written guides about it.
The "strings" program from GNU binutils is very useful. It will print the strings of printable characters in a file, quite often giving a clue to what a file contains or a program does.
If the data represents serialized Delphi objects, you should start reading about the Delphi serialization process. If that's the case, I think your best bet would be to load it using Delphi and continue your analysis from the IDE. Some informations about Delphi serialization can be found here.
EDIT: if the file does contain serialized delphi objects, then you should write a small delphi program that loads it, and "convert" the data yourself to something neutral, like xml. If you manage to do this, you should check and see if delphi supports serializing to xml. Then, you could access those objects from any language.
The unix "file" command is really useful - I don't know if there is anything like it in windows. You run it like this:
file myfile.ext
And it spits out a text description based on the magic numbers and data contained therein.
Probably it is contained within cygwin.
If you have access to the application that creates the file, you can apply changes to the application, then save the file and see the effects (Keep in mind that numbers are probably stored in little endian):
First create the file repeatedly. If the files are not binary equal, the current date/time is probably stored in the area where hte differences occur.
Maybe you want to repeat that with the software running under different environments, to see if OS version etc are stored, but this is rather unusual.
Next you can try to change single variables and create several files that only differ in the value of this variable. This helps you identify where this variable is stored.
That way you can also exclude variables that are not stored in the file: If you change them, but the files created are identical, they are not stored.
In order to test the hypotheses you worked out with the steps above, edit one of the files and have the application read it.
If you don't have access to the application itself, I suggest that you forget about it and find another way to solve your problem. There is a very high probability that it will be faster...
If file does not give a meaningful answer, you may want to try TRiD by Marco Pontello to determine whether your data is stored in a known format.
Get the Delphi application and open it in IDA Pro freeware version, and find where it writes the file, and decode how it writes the file that way.
Unless it's plan text.
Do you know the program that uses it? If so you can hook that programs write to file function and get an idea of what data its writing, the size of the data and where.
More Info: http://www.codeproject.com/KB/DLL/Win32APIHooking_Trouble.aspx
Unlike traditional hex editors which only display the raw hex bytes of a file, 010 Editor can also parse a file into a hierarchical structure using a Binary Template. The results of running a Binary Template are much easier to understand and edit than using just the raw hex bytes.
http://www.sweetscape.com/010editor/
Try to open it in a hex editor and analyse.