Can you recognize this file format as parseable? - binary

I'm a technologist at a news company and inherited systems that never migrated our old stories out of the first CMS iteration. I'd like to do so, but the cost of software support is many thousands of dollars. No matter: they are plaintext-ish and it appears I can extract the contents.
Can anyone recognize this file format as parseable for programmatic extraction?
Here are the original files per #Jongware's help in the comments:
https://github.com/mcnaughton/mystery/blob/master/1?raw=true
https://github.com/mcnaughton/mystery/blob/master/2?raw=true
https://github.com/mcnaughton/mystery/blob/master/3?raw=true
Here are the hex dumps:
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/1
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/2
https://raw.githubusercontent.com/mcnaughton/mystery/master/hexdumps/3
Here is an example of a useless ASCII version: https://gist.github.com/buley/3acd74d2bf418520c309

Related

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

Definition of a text file

I'm not really a professional programmer (I just do some number crunching), I'm just trying to learn more some things about computing.
I'm here to ask for -a reference- for a reading regarding the basic aspects of a 'file'. I'm having difficulty to understand the difference between text files and binary files. With my current understaning an image file is no more 'binary' than a text file. I'd like to understand what makes a file a text file. Is it a special sequence of bits?
Please, I just need a good reading reference (although some clarification would be welcome) and I'm not really trying to make a vague, generic, question.
Preferable, I'd like to be pointed to a technical reading containing definitions such as "a text file is a sequence of bits whose etc..."
Thanks,
Seneika.
BTW: what one finds on Wikipedia, for example, is not what I want.
Edit: horrible grammar mistake corrected...
A text file is a computer file that stores a typed document as a
series of alphanumeric characters, usually without visual formatting
information. The content may be a personal note or list, a journal or
newspaper article, a book, or any other text that can be rendered
accurately in typewritten form. Text files are similar to word
processing files in that the content of both is primarily textual;
they differ in that text files usually do not record information such
as character style and size, pagination, or other details that would
specify the appearance of a finished document. Some computer operating
systems make a basic distinction between a text file, which is
intended to be translated directly into human-readable text, and a
binary file, which is interpreted directly by the computer.
Source : Binary File & Text File
More details on WikiPedia
This should help you in your quest to answer your question :-)
http://www.wisegeek.com/what-is-a-binary-file.htm

Reversing an old file format Inbox X

I’m trying to reverse engineer an old medical imaging format called Stentor for interoperability. It was designed by a company of the same name who was subsequently bought by Phillips. But Phillips has forgotten how to read Stentor files. I have a windows program which exports JPEG from Stentor files but it’s closed source. I’d like to automate this process in order to tackle hundreds of files in this format.
The program is late-1990s Win32 or MFC executeable. It runs next to an ActiveX (.ocx) file which I’ve been able to interop with, but that file doesn’t contain the export method. I'm looking for suggestions on how to dissemble the binary in order to unearth the algorithm used to convert Stentor to JPEG. I looked through the Stentor files in hex editor and didn’t find any evidence of JPEG (although hints on finding that would be appreciated too), so I think that the program has a couple of tricks up its sleeve.
Thanks in advance.
Kyle
Few programmers implement complex routines such as image recoding themselves. Instead they tend to license libraries that do that. A very smart way to start would be searching for text strings and see if you can discover the libraries they use. This will subsequently give you a lot of insight into how the data is encoded.
Another good strategy would be to build a program that simply runs the GUI of your export program by sending mouse and keyboard events directly to it. Let this run a few days to complete your export. Reverse engineering the file format is going to be slow and expensive so for a 1 time gig it's probably not worthwhile.

How to reverse engineer binary file formats for compatibility purposes

I am working of a file preparation software to enable translators work easily and efficiently on a wide range of file formats.
As far as text-based formats (xml, php, resource files,...) are concerned, my small preparation utility works fine, but a major problem for most translators is to handle all kinds of proprietary binary formats (Framemaker, Publisher, Quark...).
These files are rarely requested and need to be opened in expensive applications (few freelance can afford to buy $20,000 worth of software just to handle a few projects per year), and even then it is not convenient to work directly in those applications anyway.
I would like to be able to read these files and extract the text in such a way that it can be translated and then re-imported in the original application with minimal effort, or even better, to recreate a valid native binary file.
Does that sound doable?
Where can I find more information on handling binary file formats and are there useful tools for these kind of jobs (besides regular hex editors)?
Thanks in advance.
Of course reverse engineering is possible, but without format specs it will take a lot of work. I would look at the return on effort regarding supporting these 'rarely requested, very expensive' formats. You may be better off spending that effort improving the core functionality of your app.
Another angle is to contact the companies with these formats, explain your goal, explain that it helps their product, and if they don't see you as competition they might be willing to help.
I know that you want to reverse engineer them - but since these may be propriety file formats you are looking at a very steep curve trying to decode them...
Some (as I have written some propritety formats for interal use before) have specific methods and objects written into them that serve some alternative process than the file contents themselves. Stuff that would prove the new file is illegal.
Just my 2 cents and I am no lawyer =>
Maybe you could pick a cheaper application which has import features for QuarkXPress. For example InDesign should be able to read Quark documents. Then use the importing application to export to whatever format you need - maybe with a help of plug-in.

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.