Binary terms Confusion - binary

How the following terms different in context of a file ?
Binary Form and Binary File.

Well, all files are binary, but you can interpret their contents in various ways.
If you open a file in Notepad, and see the content:
Everything is good
Then you might think "this is a text file", but it is a text file only because you chose to open it in Notepad and Notepad was able to interpret the contents as characters and then display them to you and you could read it.
Binary Form might be a way to say that the data is not representable in a readable manner to us humans, for instance saving an image to a file certainly produces the same types of bits as a text file does, but you could not open the file in Notepad or similar and expect to understand any of it.
To conclude, whatever "Binary Form" and "Binary File" means probably depends on the context, but here's my interpretation:
Binary Form: Non-readable form, ie. not ordinary text, understandable only if you read it in through a computer program and render it
Binary File: A file containing data in binary form. All files are basically binary, consisting of 1's and 0's.
A text file is basically just a binary file that either carries with it something that identifies its contents as being of textual nature, or is by convention opened up in a program that will try to interpret it as text.
For instance, if a web server returns a file along with a mime type that identifies the file as text, the browser might try to display it to you, whereas if the server returns a mime type that identifies it as binary (ie. not text), the browser would usually just download the file without trying to display it.
So binary file is probably, in context of whatever it is that prompted your question, the conventions that differentiates the behavior of the programs that deals with files. As I said, all files are basically binary, it's how you interpret their content that is important.

All files are binary, but I might (for a given purpose) think of the data in binary form or in the form of the characters that it (if it contained text) represented. Hence one may consider the same file as containing "Hello World" or {0x48,0x65,0x6C,0x6C,0x6F,0x20,0x57,0x6F,0x72,0x6C,0x64} depending on what we were doing with it.
A file intended for use solely in the latter way (e.g. an executable or most image formats) would generally be referred to as a binary file.
Different conventions with text files can be sensibly converted between systems, for example, a transfer may translate between new lines being represented by {0x0A}, {0x0D}, {0x0D,0x0A} or {0x1E} (and a few other formats, but they have greater incompatibilities in other ways) so that the files worked correctly on whatever system they were moved to, however doing this to an image file or an executable will ruin it, hence we talk about transferring files as text (do the translation between line endings) or as binary (don't change anything).

One might say "binary form" to refer to some non-text representation of data. It's a very vague term. Likewise, a "binary file" is just a file that doesn't contain text.
Imagine you want to store the number "123" in a file. There are several ways you might do it, but broadly, there are just two: text or binary. In text form, the number "123" would be represented as a code for the digit "1", a code for the digit "2", and a code for the digit "3". There's nothing very different between this and a file continaing the string "abc": three codes for three characters.
But in a binary file, the number "123" would probably be stored as a single "code" -- the base-2 representation of the number itself. Not the characters we use to display the number, but the actual value of the number, if you understand what I mean.

Related

Convert huge linked data dumps (RDF/XML, JSON-LD, TTL) to TSV/CSV

Linked data collections are usually given in RDF/XML, JSON-LD, or TTL format. Relatively large data dumps seem fairly difficult to process. What is a good way to convert an RDF/XML file to a TSV of triplets of linked data?
I've tried OpenRefine, which should handle this, but a 10GB file, (e.g. the person authority information from German National Library) is too difficult to process on a laptop with decent processing power.
Looking for software recommendations or some e.g. Python/R code to convert it. Thanks!
Try these:
Lobid GND API
http://lobid.org/gnd/api
Supports OpenRefine (see blogpost) and a variety of other queries. The data is hosted as JSON-LD (see context) in an elasticsearch cluster. The service offers a rich HTTP-API.
Use a Triple Store
Load the data to a triple store of your choice, e.g. rdf4j. Many triple stores provide some sort of CSV serialization. Together with SPARQL this could be worth a try.
Catmandu
http://librecat.org/Catmandu/
A strong perl based data toolkit that comes with a useful collection of ready-to-use transformation pipelines.
Metafacture
https://github.com/metafacture/metafacture-core/wiki
A Java-Toolkit to design transformation pipelines in Java.
You could use the ontology editor Protege: There, you can SPARQL the data according to your needs and save them as TSV file. It might be important, however, to configure the software beforehand in order to make the amounts of data manageable.
Canonical N-Triples may be already what you are after, as it is essentially a space-separated line-based format for RDF (you cannot naively split at space though, as you need to take care of literals, see below). Of the dataset you cited, many files are available as N-Triples. If not, use a parsing tool like rapper for the conversion to N-Triples, eg.
rapper -i turtle -o ntriples rdf-file-in-turtle-format.ttl > rdf-file-in-ntriples-format.nt
Typically, the n-triples exporters do not exploit all that is allowed in the specification regarding whitespace and use canonical n-triples. Hence, given a line in a canonical n-triples file such as:
<http://example.org/s> <http://example.org/p> "a literal" .
you can get CSV by replacing the first and the second space character of a line with a comma and remove everything after and including the last space character. As literals are the only RDF term where spaces are allowed, and as literals only allowed in object position, this should work for canonical n-triples.
You can get TSV by replacing said space characters with tab. If you also do that for the last space character and do not remove the dot, you have a file that is both a valid n-triples and a TSV file. If you take these positions as split positions, you can work with canonical n-triples files without conversion to CSV/TSV.
Note that you may have to deal with commas/tabs in the RDF terms (eg. by escaping), but that problem exists in any solution for RDF as CSV/TSV.

.pdf to binary and binary to .pdf

For a project I need to convert my .pdf, .docx or .jpg file into binary file which is consisted of 0 and 1s. This is the way that the computer saves data on the hard for example. Now I also need to be able to bring back the 0,1 info into the aforementioned file type. Can anyone guide me how to do this? Any scripting language is ok but I prefer python or C
A pdf docx or jpg file is already represented in binary (ones and zeros) when stored on your hard disc.
Hence no conversion is necessary.
If you mean something different (for example, conversion to a form that displays on the screen as ones and zeros) then you will need to articulate your question more clearly.

How can I tell TortoiseHg to display a UTF-16 file as non-binary?

In a Microsoft Access 2007 project the Access form objects are exported to files with a dedicated software by using the built-in function "SaveAsText". This is necessary because Access doesn't store any of it's code modules in isolated files at its own.
The file starts with the bytes "FF FE" (which is UTF-16 according to http://de.wikipedia.org/wiki/Byte_Order_Mark). I presume because of many NUL characters in this file, Hg treats this file as a binary file. Hence the diff pane in the TortoiseHG workbench always tells
File or diffs not displayed: File is binary.
which is quite understandable under this assumption. But nevertheless this file is just usual source code. I can view it for example in Windows' notepad without any problems.
Is there any way to tell Mercurial, that this particular file should be treated as text, not binary?
Edit:
Additionally to the marked preferred answer below I decided not to change the saving behaviour, but to use the "Visual Diff" command (select file, then press Ctrl+d) instead.
I'm guessing that you frequently or occasionally export the form objects in order to track source code changes.
The only way to convince Mercurial that a file is not binary is to avoid NUL bytes.
You may want to convert the source code files to ASCII (or maybe ANSI) encoding as an additional step in your export in order to avoid the NUL bytes. If the source code files contain Unicode characters, you might try UTF-8, as this will only do multi-byte characters when necessary and single-byte characters otherwise, thus avoiding NUL bytes again. I tried it out briefly and Mercurial handles UTF-8: it doesn't show "File is binary", but the actual diff. I committed on the commandline, but viewed the diff in TortoiseHg. I have a link about commandline encoding challenges below.
The hgrc encode/decode sections might be particularly useful in helping to filter the UTF-16 files into something that works better.
A couple other pages on Mercurial and encoding:
Character Encoding On Windows
Encoding Strategy
TortoiseHg 2.1 + Mercurial 1.9
From https://www.mercurial-scm.org/wiki/BinaryFiles:
The question naturally arises, what is a binary file anyway? It turns out there's really no good answer to this question, so Mercurial uses the same heuristic that programs like diff(1) use. The test is simply if there are any NUL bytes in a file.
For diff, export, and annotate, this will get things right almost all of the time and it will not attempt to process files it thinks are binary. If necessary, you can force these commands to treat files as text with -a.
This didn't exist at the time the question was asked, but now there's the msaccess-vcs-integration project, which exports/imports MS Access objects so that they can be version controlled.
Quote from the project's readme:
Encoding
For Access objects which are normally exported in
UCS-2-little-endian encoding , the included module automatically
converts to the source code to and from UTF-8 encoding during
export/import; this is to ensure that you don't have trouble
branching, merging, and comparing in tools such as Mercurial which
treat any file containing 0x00 bytes as a non-diffable binary
file.
If you export your forms and modules with this instead of directly using Access's SaveAsText function, Mercurial will not treat the files as binary.

Should HTML be encoded before being persisted?

Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?
Should all my text based field lengths be quadrupled in the database to allow for extra storage?
Looking for best practice rather than a solid yes or no :-)
Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?
If it's application data, I think its best to:
represent it in a form that native to the environment (e.g. unencoded in the database), and
make sure its properly translated as it crosses representational boundaries (encode when you generate the HTML page).
If you're a fan of MVC, this also helps separates the view/controller from the model (and from the persistent storage format).
Representation
For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &amp;Ms").
Translation
Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.
Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).
--
On the other hand, if you are really trying to save HTML page fragments to the database, then I am unclear by what you mean by "encoded before being stored". If its is strictly valid HTML, all the necessary values should already be encoded (e.g. &, <, etc).
The practice is to HTML encode before display.
If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.
You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.
Database vendor specific escaping on the input, html escaping on the output.
I disagree with everyone who thinks it should be decoded at display time, the chances of an attack occuring if its encoded before it reaches the database is only possible if a developer purposes decodes it before displaying it. However, if you decode it before presenting it there is always a chance that it could happen by some other newbie developer, like a new hire, or a bad implementation. If its sitting there unencoded its just waiting to pop out on the internet and spread like herpes. Losing the original data shouldnt be a concern. encode + decode should produce the same data every time. Just my two cents.
For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.

How to analyze binary file?

I have a binary file. I don't know how it's formatted, I only know it comes from a delphi code.
Does it exist any way to analyze a binary file?
Does it exist any "pattern" to analyze and deserialize the binary content of a file with unknown format?
Try these:
Deserialize data: analyze how it's compiled your exe (try File Analyzer). Try to deserialize the binary data with the language discovered. Then serialize it in a xml format (language-indipendent) that every programming language can understand
Analyze the binary data: try to save various versions of the file with little variation and use a diff program to analyze the meaning of every bit with an hex editor. Use it in conjunction with binary hacking techniques (like How to crack a Binary File Format by Frans Faase)
Reverse Engineer the application: try getting code using reverse engineering tools for the programming language used for build the app (found with File Analyzer). Otherwise use disassembler analysis tool like IDA Pro Disassembler
For my hobby project I had to reverse engineer some old game files. My approaches were:
Have a good hex editor.
Look for readable words in the binary file. Note how their distribution is. If the distance between them is constant you know it is a listing.
Look for 2-3 consequent zeros. Might indicate an int32 value.
Some dwords might be pointers into the file.
Try to identify reoccurring patterns in the file.
Seeing lots of C0-CF might indicate RLE compressed data.
I've developed Hexinator (Window & Linux) and Synalyze It! (macOS) exactly for this purpose. These applications allow you to see the binary files like in other hex editors but additionally you can create a "grammar" with the specifics of a binary file format. The grammar contains all the building blocks and is used to parse the file automatically.
Thus you can keep the knowledge you gain in the analysis and apply it to multiple files simultaneously. You can also color-code the bits and pieces of file formats for a quick overview in the hex editor.
The parsing results are displayed in a tree view where you can also modify the files easily (applying endianness et cetera).
Reverse engineering a binary file when you have some idea of what it represents is a very time consuming process. If you have no idea what it is then it will be even harder.
It is possible though, but you have to have a pretty good reason for doing so.
The first step would be to open it up in a hex editor of your choice and see if you can find any English text to point you in the direction of what the file is even supposed to represent. From there, Google "Reverse Engineering binary files", there are much more knowledgeable people than me that have written guides about it.
The "strings" program from GNU binutils is very useful. It will print the strings of printable characters in a file, quite often giving a clue to what a file contains or a program does.
If the data represents serialized Delphi objects, you should start reading about the Delphi serialization process. If that's the case, I think your best bet would be to load it using Delphi and continue your analysis from the IDE. Some informations about Delphi serialization can be found here.
EDIT: if the file does contain serialized delphi objects, then you should write a small delphi program that loads it, and "convert" the data yourself to something neutral, like xml. If you manage to do this, you should check and see if delphi supports serializing to xml. Then, you could access those objects from any language.
The unix "file" command is really useful - I don't know if there is anything like it in windows. You run it like this:
file myfile.ext
And it spits out a text description based on the magic numbers and data contained therein.
Probably it is contained within cygwin.
If you have access to the application that creates the file, you can apply changes to the application, then save the file and see the effects (Keep in mind that numbers are probably stored in little endian):
First create the file repeatedly. If the files are not binary equal, the current date/time is probably stored in the area where hte differences occur.
Maybe you want to repeat that with the software running under different environments, to see if OS version etc are stored, but this is rather unusual.
Next you can try to change single variables and create several files that only differ in the value of this variable. This helps you identify where this variable is stored.
That way you can also exclude variables that are not stored in the file: If you change them, but the files created are identical, they are not stored.
In order to test the hypotheses you worked out with the steps above, edit one of the files and have the application read it.
If you don't have access to the application itself, I suggest that you forget about it and find another way to solve your problem. There is a very high probability that it will be faster...
If file does not give a meaningful answer, you may want to try TRiD by Marco Pontello to determine whether your data is stored in a known format.
Get the Delphi application and open it in IDA Pro freeware version, and find where it writes the file, and decode how it writes the file that way.
Unless it's plan text.
Do you know the program that uses it? If so you can hook that programs write to file function and get an idea of what data its writing, the size of the data and where.
More Info: http://www.codeproject.com/KB/DLL/Win32APIHooking_Trouble.aspx
Unlike traditional hex editors which only display the raw hex bytes of a file, 010 Editor can also parse a file into a hierarchical structure using a Binary Template. The results of running a Binary Template are much easier to understand and edit than using just the raw hex bytes.
http://www.sweetscape.com/010editor/
Try to open it in a hex editor and analyse.