Dataloss when saving images binary data 'as file' - binary

I'm kinda a programming noobie but here it goes:
I opened an image file with the program binaryviewer (http://www.proxoft.com/BinaryViewer.aspx) to see its binary code.
Then I used its copy function to first copy the binary data as a .txt file, then as a .jpeg file. The resulting files are quite smaller than the original image file and are completely not readable as images.
Why are the resulting images so much smaller? What kind of data is getting lost in this process and are there ways to prevent that?
Are there specific ways to recreate the image of a file containing only the 0s and 1s of a original image file?

Whatever binary viewer you are using, it just looks at the raw bytes as stored in the file on the disk.
1) When saving 'as text' is itself determines in which format it writes the binary information to a text file. You should look that up in its documentation.
2) It is very unlikely that it has knowledge about the structure of jpg files. So again, when you save to a .jpg file, it itself chooses how to output the bytes, dumps them to a file named .jpg, but it does not have the on-disk structure of a .jpg. For any image viewer trying to read the file, it's just garbage.
But as I said in my comments, without knowing what 'binary viewer' you are talking about it's not possible to be more specific.

Related

How to retrieve original pdf stored as MySQL mediumblob?

A table containing almost four thousand records includes a mediumblob field for each record that contains the record's associated PDF report. Under both MySQL Workbench and phpMyAdmin the relevant DOCUMENT column displays the data as a BLOB button or link. In the case of phpMyAdmin the link also indicates the size of the data the Blob contains.
The issue is that when the Blob button/link is clicked, under MySQL Workbench opening any of the files using the SQL Editor only displays the raw Blob data and under phpMyAdmin th link only allows the Blob data to be saved as a .bin file instead of displaying or saving the data as a viewable PDF file. All previous attempts to retrieve the original PDFs using PHP have failed - see related earlier thread: Extract Pdf from MySql Dump Saved as Text.
The filename field in the table shows that all the stored files are PDF files. Further research and tests indicate that the mediumblob data has been stored as application/octet-streams.
My question is how can the original PDFs be retrieved as readable PDFs? Is it possible for a .bin file saved from the database to be converted or used to recover the original PDF file?
Any assistance would be greatly appreciated.
In line with my assumption and Isaac's suggestion the only solution was to be able to speak to one of the software developers. It transpires that the documents have been zipped using an third-party library as well as the header being removed before then being stored in the database.
The third-party library used is version 2.0.50727 of Chilkat, available from www.chilkatsoft.com. That version no longer appears to be available, but hopefully at least one of the later versions may do the job.
Thanks again for everyone's input and assistance.
Based on the discussion in the comments, it sounds like you'll need to either refer to the original source code or consult with the original developer to determine exactly how the data was stored.
Using phpMyAdmin to download the mediumblob data as a file will download a .bin file in many cases, I actually don't recall how it determines content type (for instance, a PNG file will download with a .png extension, but most other binary files simply download as a .bin when phpMyAdmin isn't sure what the extension should be, PDF included). So the behavior you're seeing from phpMyAdmin is expected and correct, but since the .bin file doesn't work when it's renamed to .pdf that means something has probably gone wrong with the import and upload.
BLOB data is often stored in a pretty standardized way, but it seems your data doesn't follow that method.
Without us seeing the code directly, we can't guess what exactly happened with storing the data and would only be guessing.

How can I distinguish between a binary file and a text file using dot-net languages

I have a visual basic program that downloads individual files from the internet. Those files can be PDFs, or they be actual webpages, or they can be text. Normally I don't run into any other type of file (except maybe images).
It might seem easy to know what type of file I'm downloading, just test the extension of the URL.
For instance, a URL such as "http://microsoft.com/HowToUseAzure.pdf" is likely to be a PDF. But some URLs don't look like that. I encountered one that looked like this:
http://www.sciencedirect.com/science?_ob=MImg&amp _imagekey=B6VMC-4286N5V-6-18&amp _cdi=6147&amp _orig=search&amp _coverDate=12%2F01%2F2000&amp _qd=1&amp _sk=999059994&amp wchp=dGLSzV-lSzBV&amp _acct=C000000152&amp _version=1&amp _userid=4429&amp md5=d4d53f46bdf6fb8c7431f4a2e04876e7&amp ie=f.pdf
I can do some intelligent parsing of this URL, and I end up with a first part:
http://www.sciencedirect.com/science
and the second part, which is the question mark and everything after it. In this case, the first part doesn't tell me what type of file I have, though the second part does have a clue. But the second part could be arbitrary. So my question is, what do I do in this situation? Can I download the file as 'binary' and then test the 'binary' bytes I'm getting to see if I have either
1) text 2) pdf 3) html?
If so, what is the test? What is the difference between 'binary' and 'pdf' and 'text' anyway - are there some byte values in a binary file that would simply not occur in a html file - or in a Unicode file, or in a pdf file?
Thanks.
How to detect if a file is in the PDF format?
Allow me to quote ISO 32000-1:
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.
And ISO 32000-2:
The PDF file begins with the 5 characters “%PDF–” and offsets shall be calculated from the PERCENT SIGN (25h).
What's the difference? When you encounter a file that starts with %PDF-1.0 to %PDF-1.7, you have an ISO 32000-1 file; starting with ISO 32000-2, a PDF file can also start with %PDF-2.0.
How to detect if a file is a binary file?
This is also explained in ISO 32000:
If a PDF file contains binary data, as most do, the header line shall be immediately followed by a comment line containing at least four binary characters–that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.
If you open a PDF in a text editor instead of in a PDF viewer, you'll often see that the second line looks like this:
%âãÏÓ
There is no such thing as a "plain text file"; a file always has an encoding. However, when people talk about plain text files, they often mean to say ASCII files. ASCII files are files of which all the bytes have a value lower than 128 (10000000).
Back in the old days, transfer protocols often treated PDF documents as if they were ASCII files. Instead of sending 8-bit bytes, they only sent the first 7-bit of each bytes (this is sometimes referred to as "byte shaving"). When this happens, the ASCII bytes of a PDF file are preserved, but all the binary content gets corrupted. When you open such a PDF in a PDF viewer, you see the pages of the PDF file, but every page is empty.
To avoid this problem, four non-ASCII characters are added in the PDF header. Transfer protocols check the first series of bytes, see that some of these bytes have a value higher than 127 (01111111), and therefor treat the file as a binary file.
How to detect if a file is in the HTML format?
That's more tricky, as HTML allows people to be sloppy. You'd expect the first non-white space of an HTML file to be a < character, but such a file can also be a simple XML file that is not in the HTML format.
You'd expect <!doctype html>, <html> or <body> somewhere in the file (with or without attributes inside the tag), but some people create HTML files without mentioning the DocType, and even without an <html> or a <body> tag.
Note that HTML files can come in many different encodings. For instance: when they are encoded using UTF-8, they will contain bytes with a value higher than 127.
How to detect if a file is an ASCII text file?
Just loop over all the bytes. If you find a byte with a value higher than 127, you have a file that is not in ASCII format.
What about files in Unicode?
In that case, there will be a Byte Order Mark (BOM) that allows you to detect the encoding of the file. Read more about that here.
Are there other encodings?
Of course there are! See for instance ISO/IEC 8859. In many cases, a text file doesn't know which encoding was used as the encoding isn't stored as a property of the file.

Creating a CSV file with the Report Generation Toolkit in Labview

I want to create .csv files with the Report Generation Toolkit in Labview.
They must actually be .csv files which can be opened with Notepad or something similar.
Creating a .csv is not that hard, it's just a matter of adding the extension to the file name that's going to be created.
If I create a .csv file this way it opens nicely in excel just the way it should, but if I open it in Notepad it shows all kind of characters and it doesn't even come close to the data I wrote to the file.
I create the files with the Labview code below:
Link to image (can't post image yet because I've got to few points)
I know .csv files can be created with the Write to Spreadsheet VI but I would like to use the Report Generation Toolkit because it's pretty easy to add columns and rows to the file and that is something I really need.
you can use the Robust CSV package on the lavag.org forum to read and write 2D arrays to CSV files.
http://lavag.org/files/file/239-robust-csv/
Calling a file "csv" does not make it a CSV file. I never used the toolkit to generate an Excel file, but I'm assuming it creates an XLS or XLSX file, regardless of what extension you give it, which is why you're seeing gibberish (probably XLS, since it's been around for a while and I believe XLSX is XML, not binary).
I'm not sure what your problem is with the write spreadsheet VI. It has an append input, so I assume you can use that to at least add rows directly to a file, although I can't say I ever tried it. I would prefer handling all the data in memory explicitly, where you can easily use the array functions to add rows or columns to the array and then overwrite the entire file.

How to extract hhp file from a chm file

I have an A.chm file for my windows application which runs as expected.
When I decompile it using HTML workshop I get set of html files, .hhc file, .hhk file. When I compile another file B.chm from these extracted files without changing any of the files.((I want to add more html contents to this file but looks like I am losing some information after decompiling)) The output file I get is 72K where as the original file was 75K. B.chm's contents look all file when viewed in the chm viewer but the behavior is lost when when used with the application.
After reading around I found that if .hhp can be extracted from a .chm file then it can be re-constructed as it is without losing any mapping or aliases. Is that true?
How can I extract .hhp file from a .chm file?
Thanks,
Sam
No, Yes , and no.
The original hhp can't be guaranteed extracted
however since chm is an archive type, the project could have added all project files to the archive. I assume you already would have found them if that were the case.
If the decompile process does its administration, it can regenerate the .hhp to a certain degree.
Comments and #define names will probably be lost though, maybe more, but that should not result in problems when recompiling.
But of course it could be that the decompiler is limited. You could try some other (search for something from "keytools").
If not, then take "chmlib" and start drilling down into the format.

MS Office no longer works as BLOB

Hi does anyone know why MS Office such as doc, docx and xls can no longer be viewed when retrieved from a mysql db when stored as Blob?
The doc and docx used to download and open without any problem, but now it no longer recognises the file format.
I'd like to ditto your problem. Images and plain text files upload/download from mysql blob field. Doc and docx files seemed to be corrupted. I've read somewhere of a rumor of mysql truncating the last 4 bits but I can't verify that.
I have used xvi32 (a hex editor) to compare local originals of files with versions dowloaded from BLOB/LONGBLOB fields. It seems that extra bytes, which I think represent a CRLF are appended, as far as I can work out by Windows when the file is written. This doesn't seem to be a problem for some graphic formats which are to some extent fault-tolerant, but the office XML format files are corrupted by this extra data.
I have tried using ob_clean() and ob_flush() [that is, in php] before printing/echoing the file contents, but still corrupted as far as Office is concerned.
I know this is an old thread but I would appreciate any solutions anyone might have found since it was last updated.
Did you try with a short txt file instead of .doc and see if the contents are different than what you expected?