Random access gzip stream - language-agnostic

I'd like to be able to do random access into a gzipped file.
I can afford to do some preprocessing on it (say, build some kind of index), provided that the result of the preprocessing is much smaller than the file itself.
Any advice?
My thoughts were:
Hack on an existing gzip implementation and serialize its decompressor state every, say, 1 megabyte of compressed data. Then to do random access, deserialize the decompressor state and read from the megabyte boundary. This seems hard, especially since I'm working with Java and I couldn't find a pure-java gzip implementation :(
Re-compress the file in chunks of 1Mb and do same as above. This has the disadvantage of doubling the required disk space.
Write a simple parser of the gzip format that doesn't do any decompressing and only detects and indexes block boundaries (if there even are any blocks: I haven't yet read the gzip format description)

Have a look at this link (C code example).
/* zran.c -- example of zlib/gzip stream indexing and random access
...
Gzip is just zlib with an envelope.

The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java

FWIW: I've developed a command line tool upon zlib's zran.c source code which can do random access to gzip with the creation of indexes for gzip files: https://github.com/circulosmeos/gztool
It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

interesting question. I don't understand why your 2nd option (recompress file in chunks) would double the disk space. Seems to me it would be the same, less a small amount of overhead. If you have control over the compression piece, then that seems like the right idea.
Maybe what you mean is that you don't have control over the input, and therefore it would double.
If you can do it, I'm imagining modelling it as a CompressedFileStream class that uses as its backing store, a series of 1mb gzip'd blobs. When reading, a Seek() on the stream would move to the appropriate blob and decompress. A Read() past the end of a blob would cause the stream to open the next blob.
ps: GZIP is described in IETF RFC 1952, but it uses DEFLATE for the compression format. There'd be no reason to use the GZIP elaboration if you implemented this CompressedFileStream class as I've imagined it.

Related

What are the pros and cons of Base64 file upload through JSON, as opposed to AJAX or jQuery upload?

I was tasked to write image upload to remote server and save those images locally. It was quite easy to do it with Base64 transfer through JSON and storing with Node.js. However, is there a reason to not use this type of file upload, to use AJAX or other ways? (Other than the 30% bandwidth increase, which I know of. You can still include that in your answer in order for it to be full).
The idea of base64 encoding is to avoid binary data for protocols based on text. Outside this situation, it's I think always a bad idea.
Pros
Avoidance of binary data for protocols based on text, and independance from external files.
Avoidance of delimiter collision.
Cons
Time and space increased complexity; for space it's 33–36% (33% by the encoding itself, up to 3% more by the inserted line breaks).
API response payloads are larger/too large.
User Experience is negatively impacted, unless one invoke some lazy loading.
By including all image data together in one API response, the app
must receive all data before drawing anything on screen. This means
users will see on-screen loading states for longer and the app will
appear sluggish as users wait.
This is however mitigated with Axios and some lazy loader such as react-lazyload or lazyload or so.
CDN caching is harder. Contrary to image files, the Base64 strings inside an API response cannot be delivered via a CDN cache. The whole API response must be delivered by CDN. (cf., Don’t use Base64 encoded images on mobile and Why "optimizing" your images with Base64 is almost always a bad idea)
Image caching on the device is no longer possible.
Content management becomes harder on server side. Most content management tools handle images as binary files. But then when managing in binary, there is the time overhead of encoding/decoding.
No security gain and overhead in engineering to mitigate (Sanitizing, Input Validation, Escaping). Example of XSS attack: Preventing XSS with Base64 encoding: The False sense of web application security
The developers of that site might have opted to make the website appear more secure by having cryptic URLs and whatnot. However, that
doesn't mean this is security by obscurity.
If their website is vulnerable to SQL injection and they try to hide that by encoding the URLs, then it's security by obscurity. If their website is well secured against SQL injection; XSS; CSRF; etc., and they deiced to encode the URLs like that, then it's just plain stupidity.
It does not help with text encoded images such as svg (Probably Don’t Base64 SVG)
Data URIs aren't supported on IE6 or IE7, nor on Opera before 7.2 (Which browsers support data URIs and since which version?)
References
https://en.wikipedia.org/wiki/Base64
https://en.wikipedia.org/wiki/Delimiter#Delimiter_collision
SO: What is base 64 encoding used for?
https://medium.com/snapp-mobile/dont-use-base64-encoded-images-on-mobile-13ddeac89d7c
https://css-tricks.com/probably-dont-base64-svg/
https://security.stackexchange.com/questions/46362/purpose-of-using-base64-encoded-urls
https://bunnycdn.com/blog/why-optimizing-your-images-with-base64-is-almost-always-a-bad-idea/
https://www.davidbcalhoun.com/2011/when-to-base64-encode-images-and-when-not-to/
Data Encoding
Every data Encoding and Decoding can be used duo various reasons, which came up with benefits and downsides.
like:
Error-detection encodings : which can detect errors but increase data usage.
Encryption encodings: turns data to cipher which intruder wont decipher.
There are a lot of Encoding Algorithms which Alter Data in
Which has some usefullness to do that.
but with
Base64 Encoding, its encode every 6-bit data into one character (8-bit) .
3 Byte to 4 Byte but it only includes alphanumeric(62 distinc) and 2 signs.
its benefits is it Dose not have special chars and signs
Base64 Purpose
it make possible to transfer Any Data with Channels Which Prohibits us to have:
special chars like ' " / \ ...
non-printable Ascii like \0 \n \r \t \a
8-bit Ascii codes (ascii with 1 MSB )
binary files usually includes any data which if turns in ascii can be any 8-bit character.
in some protocols and application there are I/O Interfaces Which Does only accepts a handful of chars (alphanumeric with a few of signs).
duo to:
prevent to code injection (ex: SQL injection or any prgramming-language-syntax-like characters ; )
or just some character has already has a meaning in their protocol (ex: in URI QueryString character & has a meaning and cannot be in any QueryString Value)
or maybe the input is not intended to accept non-alphanumerical values. (ex: it should accept only Human Names)
but with base64 encoding you can encode anything and transfer it with
any channel you want.
Example:
you can encode an image or application and save it in DBMS with SQL
you can include some binary data in URI
you can send binary files in a protocol which has been designed to accepts only human chats as alphanumerical like IRC Channel
Base64 is a just a converting format that HTTP server cannot accept binary data in the contents except the HTTP Header type is binary or acceptable format defined by web-server.
As you might know, JSON can contain various formats and information; thus, you can contain such as
{
IMG_FILENAME="HELLO",
IMG_TYPE="IMG/JPEG",
DATA="~~~BASE64 ENCODED IMAGE~~~~"
}
You can send JSON file through AJAX or other method. But, as I told you, HTTP server have various limitation because it should keep RFC2616 (https://www.rfc-editor.org/rfc/rfc2616).
In short, Sending Through JSON can contain various data.
AJAX is just a type of sending as other ways does.
I used same solution in one of my project.
The only concern is the request body size. If all your images are small, like a few M, then you should be fine.
My server is asp.net core, its maxAllowedContentLength value is 30000000, which is approximately 28.6MB. When the image size is over this, the request failed with error "request body too large".
I think node.js should have similar setting, make sure to adjust it to meet your need.
Please note that when the request size is too big, the possibility of request timeout increases accordingly due to the network traffic. This will be an issue especially for the requests from phones.
I think the use of base64 is valid.
The only doubt is the size of the request, but this can be circumvented if you divide this base64 in the frontend, if a 30mb file you could divide each request into 5mb and in the backend put the parts together, this is useful even to do the "keep downloading" "when you have a problem with the network and corrupt some part.
Hugs
Base64 converts your data to an ASCII representation of the binary data. It allows you to embed your data in text streams such as JSON for example. Base64 increases the size of the data transferred by 33%.
multipart/form-data is the standard way of transferring binary data in HTTP requests. It allows you to use specific encodings / content types for each part you'd like to transfer. In my opinion, you should stick to multipart uploads unless you have specific requirements or device/SDK capabilities.
Checkout these links
What is difference Between Base64 and Multipart?
Base64 image upload VS Binary image upload?

How to detect type of compression (if no header is provided)

I have a blob of binary data (network capture) that is parsed by a binary on my machine. I am assuming that because the binary expects a type of data, no header information indicating the type of compression is necessary as that would be wasted bandwidth. How then, if given an arbitrary amount of binary data, can I determine the method of compression? Also how do I go about decompressing?
PEiD plugin "Kanal" tells me the binary has "BZIP2 [long]" and "ZLIB deflate [long]" features in it, but what program can I use to say "treat this arbitrary data like it's bzip2, even though there is no header/magic number, and see what the decompression result is" and where "bzip2" can be replaced with
any compression method? Is this possible?
edit: this is similar to: How to detect type of compression used on the file? (if no file extension is specified) only this time, no header info is specified.
Thanks!
Just start decompressing. zlib will detect very quickly if it is not deflate data being fed to it. I don't know how quickly libbzip2 will figure that out, but if you have only those two choices then just try zlib first.

How to handle loading of LARGE JSON files

I've been working on a WebGL application that requires a tremendous amount of point data to draw to the screen. Currently, that point and surface data is stored on the webserver I am using, and ALL 120MB of JSON is being downloaded by the browser on page load. This takes well over a minute on my network, which is not optimal. I was wondering if anyone has any experience/tips about loading data this large. I've tried eliminating as much whitespace as possible, but that barely made a dent in file size.
Is there any way to either compress this file immensely, or otherwise a better way to download such a large amount of data? Any help would be great!
JSON is incredibly redundant so it compresses well, which is then decompressed on the client.
JavaScript implementation of Gzip
Alternatively you could chunk the data up into 1 MB chunks to be sent over one at a time.
Also the user probably can't interact with 120 MB of data at a time, so maybe implement some sort of level of detail system?
If you control the web server sending the data, you could try enabling compression of json data.
This is done by adding the following in applicationhost.config (IIS 7):
<system.webServer>
<urlCompression doDynamicCompression="true" />
<httpCompression>
<dynamicTypes>
<add mimeType="application/json" enabled="true" />
<add mimeType="application/json; charset=utf-8" enabled="true" />
</dynamicTypes>
</httpCompression>
</system.webServer>
You would then need to restart the App pool for your app.
Source
A few things you might consider:
Is your server compressing the file before sending?
Does this data change often? If it does not, you could set your expires header to a very long time, so the browser could keep it in cache. It wouldn't help on the first page access, but on subsequent ones the file wouldn't have to be loaded again.
Is there a lot of repeating stuff in your json file? For instance, if your object keys are long, you could replace them with shorter ones, send, and replace again in the browser. The benefits will not be that great if the file is compressed (see item 1) but depending on your file it might help a little.
Is all this data consumed by the browser at once? If it's not, you could try breaking it down into smaller pieces, and start processing the first parts while the others load.
But the most important: are you sure JSON is the right tool for this job? A general purpose compression tool can only go so far, but if you explore the particular characteristics of your data you might be able to achieve better results. If you give more details on the format you're using we may be able to help you more.
I had an issue sort of like the one you had, so I've decide to use binary numbers instead of strings so when the user create request to my server I answer with a numbers.
For ex' lets say the user is a table in a restaurant so instead of sending a string like 'burger,orange juice, water,etc...' I can send the number 15 and translate into binary 8,4,2,1.
when the user ask for multiple thing, lets say 4 burgers it will be hard to follow so you can add two numbers of an array with the binary.
I found it very useful and more secure.
If you decide doing it, I suggest on dev' mode use string when you deploy translate into binary.

Perfmon .blg file specification / parsing library

Where can I find a detailed, low-level spec for the Perfmon binary .blg file format? Or even better, has anyone written a low level, open source library (preferably in C, but any language would do) for parsing .blg files?
There's a tool called relog that can convert these files to csv or other formats.
http://blog.bennett-scharf.com/2008/12/17/converting-an-existing-perfmon-blg-file-to-csv/
Link
Link
This won't help for looking at historical data, but if you have access to the systems running Perfmon, you may want to look at Logman. With Logman you can set performance counters AND specify the output format, that way you can just chose a format that is easy to parse. See the -f option:
-f { bin | bincirc | csv | tsv | SQL } : Specifies the file format used for collecting performance counter and trace data. You can use binary, circular binary, comma and tab separated, or SQL database formats when collecting performance counters.
As others have said if you also have historical records you need to parse you can use the Relog utility to convert existing .blg files in to a more useful format.
Another option is to export the perfmon Data Collection Set as a template, and change the log file format in the XML - look for the LogFileFormat tag and change the value to the format of your preference
0 = CSV, 1 = TSV, 2 = SQL, 3 = the default binary format.
I was looking for a way to incorporate PerfMon data into a SIEM, and found that getting perfmon to log to a SQL DB (and reading the data from a SQL view, from the SIEM agent) was the best way of doing this.
I can't say much about other products, but in LogRhythm SIEM, you need a "UDLA" (universal database log adapter) log source for it - and if you want to parse/contextualise the metadata, you'll need some parsing rules (ie regex) for what the query returns.
It's useful to see things like "if there's x number of logon errors, AND Avail MBytes is less than 100, THEN trigger alarm/AIEngine rule 'Insufficient Memory to Process Logons'".
That's a pretty lame example, but you get the idea.
You might also look at other things which have a potentially malicious explanation, and also a benign explanation.
For example - if you see a large amount of failed attempts to reset passwords, this might usually indicate some malicious behaviour - but not if you see the perfmon counters telling you that the Domain Controller has a total of less than 1,000 free system PTEs (admittedly unlikely on a 64-bit OS), or is seeing CPU usage of more than 95%. In which case, it's not necessarily a security issue, it's a load/capacity issue - or something is very wrong with your DC.

Reverse engineering a custom data file

At my place of work we have a legacy document management system that for various reasons is now unsupported by the developers. I have been asked to look into extracting the documents contained in this system to eventually be imported into a new 3rd party system.
From tracing and process monitoring I have determined that the document images (mainly tiff files) are stored in a number of 1.5GB files. These files seem to be read from a specific offset and then written to a tmp file that is then served via a web app to the client, and then deleted.
I guess I am looking for suggestions as to how I can inspect these large files that contain the tiff images, and eventually extract and write them to individual files.
Are the TIFFs compressed in some way? If not, then your job may be pretty easy: stitch the TIFFs together from the 1.5G files.
Can you see the output of a particular 1.5G file (or series of them)? If so, then you should be able to piece together what the bytes should look like for that TIFF if it were uncompressed.
If the bytes don't appear to be there, then try some standard compressions (zip, tar, etc.) to see if you get a match.
I'd open a file, seek to the required offset, and then stream into a tiff object (ideally one that supports streaming from memory or file). Then you've got it. Poke around at some of the other bits, as there's likely metadata about the document that may be useful to the next system.