For a project I need to convert my .pdf, .docx or .jpg file into binary file which is consisted of 0 and 1s. This is the way that the computer saves data on the hard for example. Now I also need to be able to bring back the 0,1 info into the aforementioned file type. Can anyone guide me how to do this? Any scripting language is ok but I prefer python or C
A pdf docx or jpg file is already represented in binary (ones and zeros) when stored on your hard disc.
Hence no conversion is necessary.
If you mean something different (for example, conversion to a form that displays on the screen as ones and zeros) then you will need to articulate your question more clearly.
Related
I use MATLAB for calculating and plotting purposes. I want to write a plot as a image file like PNG or JPG into a MySQL database (that I can retreive later for a webbrowser). In other words I want to write a blob to the database that is a PNG or JPG file.
If I search for that I get http://www.mathworks.com/matlabcentral/answers/97768-how-do-i-insert-an-image-or-figure-into-a-database-using-the-database-toolbox-in-matlab but here a matrix of MATLAB is written as an array to a database. That is much bigger than a compressed PNG file and thus does not allow to see subplots and other things and cannot be displayed by a webbrowser.
A workaround would be to write the plot to a file and use MATLAB (or a external script tool based on python or so) to read that file as blob and write it as blob to the database.
Do you know a possibility to write a plot as PNG, JPG directly to a databse without the detour of a file?
I also asked the MATLAB support and they gave me an positive answer. A solution is the figToImStream function of MATLAB compiler toolbox: "Stream out figure as byte array encoded in format specified, creating signed byte array in .png format". The downside is that MATLAB compiler toolbox is quite expensive...
How the following terms different in context of a file ?
Binary Form and Binary File.
Well, all files are binary, but you can interpret their contents in various ways.
If you open a file in Notepad, and see the content:
Everything is good
Then you might think "this is a text file", but it is a text file only because you chose to open it in Notepad and Notepad was able to interpret the contents as characters and then display them to you and you could read it.
Binary Form might be a way to say that the data is not representable in a readable manner to us humans, for instance saving an image to a file certainly produces the same types of bits as a text file does, but you could not open the file in Notepad or similar and expect to understand any of it.
To conclude, whatever "Binary Form" and "Binary File" means probably depends on the context, but here's my interpretation:
Binary Form: Non-readable form, ie. not ordinary text, understandable only if you read it in through a computer program and render it
Binary File: A file containing data in binary form. All files are basically binary, consisting of 1's and 0's.
A text file is basically just a binary file that either carries with it something that identifies its contents as being of textual nature, or is by convention opened up in a program that will try to interpret it as text.
For instance, if a web server returns a file along with a mime type that identifies the file as text, the browser might try to display it to you, whereas if the server returns a mime type that identifies it as binary (ie. not text), the browser would usually just download the file without trying to display it.
So binary file is probably, in context of whatever it is that prompted your question, the conventions that differentiates the behavior of the programs that deals with files. As I said, all files are basically binary, it's how you interpret their content that is important.
All files are binary, but I might (for a given purpose) think of the data in binary form or in the form of the characters that it (if it contained text) represented. Hence one may consider the same file as containing "Hello World" or {0x48,0x65,0x6C,0x6C,0x6F,0x20,0x57,0x6F,0x72,0x6C,0x64} depending on what we were doing with it.
A file intended for use solely in the latter way (e.g. an executable or most image formats) would generally be referred to as a binary file.
Different conventions with text files can be sensibly converted between systems, for example, a transfer may translate between new lines being represented by {0x0A}, {0x0D}, {0x0D,0x0A} or {0x1E} (and a few other formats, but they have greater incompatibilities in other ways) so that the files worked correctly on whatever system they were moved to, however doing this to an image file or an executable will ruin it, hence we talk about transferring files as text (do the translation between line endings) or as binary (don't change anything).
One might say "binary form" to refer to some non-text representation of data. It's a very vague term. Likewise, a "binary file" is just a file that doesn't contain text.
Imagine you want to store the number "123" in a file. There are several ways you might do it, but broadly, there are just two: text or binary. In text form, the number "123" would be represented as a code for the digit "1", a code for the digit "2", and a code for the digit "3". There's nothing very different between this and a file continaing the string "abc": three codes for three characters.
But in a binary file, the number "123" would probably be stored as a single "code" -- the base-2 representation of the number itself. Not the characters we use to display the number, but the actual value of the number, if you understand what I mean.
At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.
Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.
I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.
I have a binary file to which I'm trying to write however I dont have the file format specification nor have found it using google, I've been looking at the file using a hex editor but so far has only give me a headache, is there a better way to decipher the format of the file so that I can append data to it?
File carving tools such as scalpel won't really help here. They're made for extracting files with known header and/or footer signatures from a memory dump or some larger, composite file.
For your scenario, I would recommend a hex editor with templating capability, like the 010 Editor. This will allow you to name and annotate "fields" in the binary as you learn more about what each part of the file does. Unfortunately, the process of finding out what each field does is mostly manual. As a methodology, just start playing with it. Change some values in your current binary and see what happens. Expect to spend significant time on it, but also enjoy the process!
you may want to search it with a open source forensic application like foremost or scalpel. They will do most of the grunt work for you, you just likely wont learn anything.
I have a binary file. I don't know how it's formatted, I only know it comes from a delphi code.
Does it exist any way to analyze a binary file?
Does it exist any "pattern" to analyze and deserialize the binary content of a file with unknown format?
Try these:
Deserialize data: analyze how it's compiled your exe (try File Analyzer). Try to deserialize the binary data with the language discovered. Then serialize it in a xml format (language-indipendent) that every programming language can understand
Analyze the binary data: try to save various versions of the file with little variation and use a diff program to analyze the meaning of every bit with an hex editor. Use it in conjunction with binary hacking techniques (like How to crack a Binary File Format by Frans Faase)
Reverse Engineer the application: try getting code using reverse engineering tools for the programming language used for build the app (found with File Analyzer). Otherwise use disassembler analysis tool like IDA Pro Disassembler
For my hobby project I had to reverse engineer some old game files. My approaches were:
Have a good hex editor.
Look for readable words in the binary file. Note how their distribution is. If the distance between them is constant you know it is a listing.
Look for 2-3 consequent zeros. Might indicate an int32 value.
Some dwords might be pointers into the file.
Try to identify reoccurring patterns in the file.
Seeing lots of C0-CF might indicate RLE compressed data.
I've developed Hexinator (Window & Linux) and Synalyze It! (macOS) exactly for this purpose. These applications allow you to see the binary files like in other hex editors but additionally you can create a "grammar" with the specifics of a binary file format. The grammar contains all the building blocks and is used to parse the file automatically.
Thus you can keep the knowledge you gain in the analysis and apply it to multiple files simultaneously. You can also color-code the bits and pieces of file formats for a quick overview in the hex editor.
The parsing results are displayed in a tree view where you can also modify the files easily (applying endianness et cetera).
Reverse engineering a binary file when you have some idea of what it represents is a very time consuming process. If you have no idea what it is then it will be even harder.
It is possible though, but you have to have a pretty good reason for doing so.
The first step would be to open it up in a hex editor of your choice and see if you can find any English text to point you in the direction of what the file is even supposed to represent. From there, Google "Reverse Engineering binary files", there are much more knowledgeable people than me that have written guides about it.
The "strings" program from GNU binutils is very useful. It will print the strings of printable characters in a file, quite often giving a clue to what a file contains or a program does.
If the data represents serialized Delphi objects, you should start reading about the Delphi serialization process. If that's the case, I think your best bet would be to load it using Delphi and continue your analysis from the IDE. Some informations about Delphi serialization can be found here.
EDIT: if the file does contain serialized delphi objects, then you should write a small delphi program that loads it, and "convert" the data yourself to something neutral, like xml. If you manage to do this, you should check and see if delphi supports serializing to xml. Then, you could access those objects from any language.
The unix "file" command is really useful - I don't know if there is anything like it in windows. You run it like this:
file myfile.ext
And it spits out a text description based on the magic numbers and data contained therein.
Probably it is contained within cygwin.
If you have access to the application that creates the file, you can apply changes to the application, then save the file and see the effects (Keep in mind that numbers are probably stored in little endian):
First create the file repeatedly. If the files are not binary equal, the current date/time is probably stored in the area where hte differences occur.
Maybe you want to repeat that with the software running under different environments, to see if OS version etc are stored, but this is rather unusual.
Next you can try to change single variables and create several files that only differ in the value of this variable. This helps you identify where this variable is stored.
That way you can also exclude variables that are not stored in the file: If you change them, but the files created are identical, they are not stored.
In order to test the hypotheses you worked out with the steps above, edit one of the files and have the application read it.
If you don't have access to the application itself, I suggest that you forget about it and find another way to solve your problem. There is a very high probability that it will be faster...
If file does not give a meaningful answer, you may want to try TRiD by Marco Pontello to determine whether your data is stored in a known format.
Get the Delphi application and open it in IDA Pro freeware version, and find where it writes the file, and decode how it writes the file that way.
Unless it's plan text.
Do you know the program that uses it? If so you can hook that programs write to file function and get an idea of what data its writing, the size of the data and where.
More Info: http://www.codeproject.com/KB/DLL/Win32APIHooking_Trouble.aspx
Unlike traditional hex editors which only display the raw hex bytes of a file, 010 Editor can also parse a file into a hierarchical structure using a Binary Template. The results of running a Binary Template are much easier to understand and edit than using just the raw hex bytes.
http://www.sweetscape.com/010editor/
Try to open it in a hex editor and analyse.