Create zip with a precise fixed size - language-agnostic

I'd like to create a zip file with a specific size, e.g. 10.000.000 byte.
Let's say that I have to zip a number of files, and they sum up to 8.123.456 byte (random number) once zipped.
I'd like to add "rubbish" to reach exactly the size I chose (10Mb in this case).
Of course I'm assuming the files I have to zip do not exceed my limit.
I don't have any specific language requirement to do that.

Just concatenate a "rubbish" file and a valid Zip file that match together your 10Mb.
The Zip format was made for such situations. Typically, you can concatenate an executable and a Zip file for making an installer or a self-extracting archive.
Here an example (I concatenate a randomly chosen GIF and a Zip). The Info-Zip unzip tool says:
unzip -v appended.zip
Archive: appended.zip
warning [appended.zip]: 1772887 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length Method Size Ratio Date Time CRC-32 Name
-------- ------ ------- ----- ---- ---- ------ ----
186 Defl:N 140 25% 15.02.12 12:03 56202537 zip-ada/debg_za.cmd
349 Defl:N 202 42% 10.02.12 22:19 7718ccec zip-ada/debug.pra
4357 Defl:N 1381 68% 24.09.16 06:43 30f2fef0 zip-ada/demo/demo_csv_into_zip.adb
1015 Defl:N 513 50% 02.10.18 15:49 f0edcf97 zip-ada/demo/demo_unzip.adb
603 Defl:N 310 49% 20.03.16 08:26 b3906614 zip-ada/demo/demo_zip.adb
161483 Defl:N 39845 75% 27.08.16 16:35 9f24d1fe zip-ada/doc/appnote.txt
...
The decompression test is successful (as expected):
unzip -t appended.zip
Archive: appended.zip
warning [appended.zip]: 1772887 extra bytes at beginning or within zipfile
(attempting to process anyway)
testing: zip-ada/debg_za.cmd OK
testing: zip-ada/debug.pra OK
testing: zip-ada/demo/demo_csv_into_zip.adb OK
testing: zip-ada/demo/demo_unzip.adb OK
testing: zip-ada/demo/demo_zip.adb OK
testing: zip-ada/doc/appnote.txt OK
testing: zip-ada/doc/lzma-specification.txt OK
...
No errors detected in compressed data of appended.zip.

Wrap the zipped data in another file formatted as:
<zip file length><zip file content><random data until size is as required>
You know the number of random data because it is the rquired size minus the length of the file minus the size of file length representation.
To unzip, just reverst the process.

You can add you "rubbish" file without compression within your target Zip file.
The added size will be the "rubbish" file's size, plus 76 bytes, plus twice the "rubbish" file's name's length. Some zipping tools add extra metadata on top of that.

Related

Maximum size of xattr in OS X

I would like to use xattr to store some meta-data on my files directly on the files. These are essentially tags that I use for the categorization of files when I do searches on them. My goal is to extend the usual Mac OS X tags by associating more info to each tag, for instance the date of addition of that tag and maybe other thing.
I was thinking to add an xattr to the files, using xattr -w. My first guess would be to store something like a JSON in this xattr value, but I was wondering
1) what are the limits of size I can store in an xattr? (man of xattr is vauge and refers to something called _PC_XATTR_SIZE_BITS which I cannot locate anywhere)
2) anything wrong with storing a JSON formatted string as an xattr?
According to man pathconf, there is a “configurable system limit or option variable” called _PC_XATTR_SIZE_BITS which is
the number of bits used to store maximum extended attribute size in bytes. For
example, if the maximum attribute size supported by a file system is 128K, the value
returned will be 18. However a value 18 can mean that the maximum attribute size can be
anywhere from (256KB - 1) to 128KB. As a special case, the resource fork can have much
larger size, and some file system specific extended attributes can have smaller and preset
size; for example, Finder Info is always 32 bytes.
You can determine the value of this parameter using this small command line tool written in Swift 4:
import Foundation
let args = CommandLine.arguments.dropFirst()
guard let pathArg = args.first else {
print ("File path argument missing!")
exit (EXIT_FAILURE)
}
let v = pathconf(pathArg, _PC_XATTR_SIZE_BITS)
print ("_PC_XATTR_SIZE_BITS: \(v)")
exit (EXIT_SUCCESS)
I get:
31 bits for HFS+ on OS X 10.11
64 bits for APFS on macOS 10.13
as the number of bits used to store maximum extended attribute size. These imply that the actual maximum xattr sizes are somewhere in the ranges
1 GiB ≤ maximum < 2 GiB for HFS+ on OS X 10.11
8 EiB ≤ maximum < 16 EiB for APFS on macOS 10.13
I seem to be able to write at least 260kB, like this by generating 260kB of nulls and converting them to the letter a so I can see them:
xattr -w myattr "$(dd if=/dev/zero bs=260000 count=1|tr '\0' a)" fred
1+0 records in
1+0 records out
260000 bytes transferred in 0.010303 secs (25235318 bytes/sec)
And then read them back with:
xattr -l fred
myattr: aaaaaaaaaaaaaaaaaa...aaa
And check the length returned:
xattr -l fred | wc -c
260009
I suspect this is actually a limit of ARGMAX on the command line:
sysctl kern.argmax
kern.argmax: 262144
Also, just because you can store 260kB in an xattr, that does not mean it is advisable. I don't know about HFS+, but on some Unixy filesystems, the attributes can be stored directly in the inode, but if you go over a certain limit, additional space has to be allocated on disk for the data.
——-
With the advent of High Sierra and APFS to replace HFS+, be sure to test on both filesystems - also make sure that Time Machine backs up and restores the data as well and that utilities such as ditto, tar and the Finder propagate them when copying/moving/archiving files.
Also consider what happens when Email a tagged file, or copy it to a FAT-formatted USB Memory Stick.
I also tried setting multiple attributes on a single file and the following script successfully wrote 1,000 attributes (called attr-0, attr-1 ... attr-999) each of 260kB to a single file - meaning that the file effectively carries 260MB of attributes:
#!/bin/bash
for ((a=1;a<=1000;a++)) ; do
echo Setting attr-$a
xattr -w attr-$a "$(dd if=/dev/zero bs=260000 count=1 2> /dev/null | tr '\0' a)" fred
if [ $? -ne 0 ]; then
echo ERROR: Failed to set attr
exit
fi
done
These can all be seen and read back too - I checked.

What is the max number of files to select in an HTML5 [multiple] file input?

I have 64000 small images I want to upload to my website (using existing validation, so no FTP etc). I've created an HTML5 [multiple] type=file input for this a while back to be used for a hundred or hundreds of images. Hundreds is not a problem. The images are batched and sent to the server.
But when I select a folder of ~ 16000 images, the file input's FileList is empty... The onchange event triggers, but the file list is empty. The browser (or file system or OS?) seems to have a problem selecting this many files.
I've created a very small tool to help determine what could be the max: http://jsfiddle.net/rudiedirkx/Ehhk5/1/show/
$inp.onchange = function(e) {
var l = 0, b = 0;
for (var i=0, F=this.files, L=F.length; i<L; i++) {
l += F[i].name.length;
b += F[i].size;
}
$nf.innerHTML += this.files.length + ' files: ' + (b/1000/1000) + ' MB / ' + l + ' chars of filename<br>';
};
All it does is count:
the number of files
the number of characters all file names are combined
the number of MB of total file size
When I try this, I get as very most:
1272 files: 176.053987 MB / 31469 chars of filename
(On 32 & 64 bit Win7, Chrome 26-52)
The next image (which fails) would be:
1273 images, which is not an obvious cut-off
between 176 and 177 MB filesize, also not an obvious cut-off
less than 32000 chars of filenames, also not an obvious cut-off, although it sort-of maybe looks like 32k...
In my calc, 1 MB = 1000^2 Bytes, not 1024^2. (That would be a MiB, but maybe my OS/filesystem/browser disagrees.)
My question would be: why this many files? Why this max? Is it OS dependent or browser dependent? Where do I find the specs for that? Is it JS' fault? Search for "file input max files" et al only results into the [max] attribute, which is irrelevant.
More test results:
In Firefox the max seems to be much higher. At least "2343 files: 310.66553999999996 MB / 60748 chars of filename" (that's all the files I have right here)
In Firefox also: "16686 files: 55.144415 MB / 146224 chars of filename" (much smaller, but more files)
Update
Chrome 52 canary Windows is still 32k of file name
Firefox (44+) Windows is still unlimited
why this many files?
The number of files depends on the number of characters all file names are combined.
Why this max?
In the Windows API, the maximum path length limitation is 256 chars, the Unicode version API is 32,767 chars.
Chrome simply sets the max path length of the Unicode version API, so it's about 32k chars as you observed.
Check this fix: https://code.google.com/p/chromium/issues/detail?id=44068
Firefox dynamically allocates a buffer big enough to hold the size of multiple selected files, which could handle much larger path length.
Check this fix: https://bugzilla.mozilla.org/show_bug.cgi?id=660833
Is it OS dependent or browser dependent?
Both.
Where do I find the specs for that?
For Windows API usage and reference:
http://msdn.microsoft.com/en-us/library/aa365247.aspx (Maximum Path Length Limitation)
http://msdn.microsoft.com/en-us/library/ms646839(VS.85).aspx
Is it JS' fault?
No.

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Creating "holes" in a binary H.264 bitstream

I am trying to simulate data loss in a video by selectively removing H.264 bitstream data. The data is simply a raw H.264 file, which is essentially a binary file. My plan is to delete 2 bytes for every 100 bytes so as to achieve a 2% loss. Eventually, I will be testing the effectiveness of some motion vector error concealment algorithms.
It would be nice to be able to do this in a Unix environment. So far, I have investigated the command xxd for a bit and I am able to save a specific portion of a hex dump from a binary file. For example, to skip the first 50 bytes of a binary bitstream and save the subsequent 100 bytes, I would do the following:
xxd -s 50 -l 100 inputBinaryFile | xxd -r > outputBinaryFile
I'm hoping to incorporate something similar into a bash script that will automatically delete the last 2 bytes per 100 bytes. Furthermore, I would like the script to skip everything before the second occurrence of the sequence 00 00 01 06 05 (first P-frame SEI start code).
I don't know how much easier this could be in a C-based language but my programming skills are quite limited and I would rather deal with just Linux programming for now if possible.
Thanks.

How to detect that this is a valid VALID binary STL(stereolithography) file

I have an html form in which users upload either binary or ascii stl files. However I want to make sure only valid binary and ascii files are uploaded.Such that if a user changes the extension of say a PDF file to .stl (extension for 3D binary and ascii stl files), the code must detect that its an invalid stl file.
Quoting wikipedia:
An ASCII STL file begins with the line:
`solid name`
where name is an optional string (though if name is omitted there must still be a space after solid).
So, to confirm an ASCII STL, check for '^solid (name)?$' on the first line.
To determine if the file is binary STL, take advantage of the length field at offset 80. It specifies the number of triangles in the file.
So, to confirm a Binary STL file, check for this expression:
filesize == UINT32#80 * 50 + 84
Number 84 here is the total size of binary STL header (80 bytes) and a 4-byte number (number of triangles) following the header.