What is the max number of files to select in an HTML5 [multiple] file input? - html

I have 64000 small images I want to upload to my website (using existing validation, so no FTP etc). I've created an HTML5 [multiple] type=file input for this a while back to be used for a hundred or hundreds of images. Hundreds is not a problem. The images are batched and sent to the server.
But when I select a folder of ~ 16000 images, the file input's FileList is empty... The onchange event triggers, but the file list is empty. The browser (or file system or OS?) seems to have a problem selecting this many files.
I've created a very small tool to help determine what could be the max: http://jsfiddle.net/rudiedirkx/Ehhk5/1/show/
$inp.onchange = function(e) {
var l = 0, b = 0;
for (var i=0, F=this.files, L=F.length; i<L; i++) {
l += F[i].name.length;
b += F[i].size;
}
$nf.innerHTML += this.files.length + ' files: ' + (b/1000/1000) + ' MB / ' + l + ' chars of filename<br>';
};
All it does is count:
the number of files
the number of characters all file names are combined
the number of MB of total file size
When I try this, I get as very most:
1272 files: 176.053987 MB / 31469 chars of filename
(On 32 & 64 bit Win7, Chrome 26-52)
The next image (which fails) would be:
1273 images, which is not an obvious cut-off
between 176 and 177 MB filesize, also not an obvious cut-off
less than 32000 chars of filenames, also not an obvious cut-off, although it sort-of maybe looks like 32k...
In my calc, 1 MB = 1000^2 Bytes, not 1024^2. (That would be a MiB, but maybe my OS/filesystem/browser disagrees.)
My question would be: why this many files? Why this max? Is it OS dependent or browser dependent? Where do I find the specs for that? Is it JS' fault? Search for "file input max files" et al only results into the [max] attribute, which is irrelevant.
More test results:
In Firefox the max seems to be much higher. At least "2343 files: 310.66553999999996 MB / 60748 chars of filename" (that's all the files I have right here)
In Firefox also: "16686 files: 55.144415 MB / 146224 chars of filename" (much smaller, but more files)
Update
Chrome 52 canary Windows is still 32k of file name
Firefox (44+) Windows is still unlimited

why this many files?
The number of files depends on the number of characters all file names are combined.
Why this max?
In the Windows API, the maximum path length limitation is 256 chars, the Unicode version API is 32,767 chars.
Chrome simply sets the max path length of the Unicode version API, so it's about 32k chars as you observed.
Check this fix: https://code.google.com/p/chromium/issues/detail?id=44068
Firefox dynamically allocates a buffer big enough to hold the size of multiple selected files, which could handle much larger path length.
Check this fix: https://bugzilla.mozilla.org/show_bug.cgi?id=660833
Is it OS dependent or browser dependent?
Both.
Where do I find the specs for that?
For Windows API usage and reference:
http://msdn.microsoft.com/en-us/library/aa365247.aspx (Maximum Path Length Limitation)
http://msdn.microsoft.com/en-us/library/ms646839(VS.85).aspx
Is it JS' fault?
No.

Related

What File Format Has This Magic Header?

I've got a bunch of files that from metadata I can tell are supposed to be PDFs. Some of them are indeed complete PDFs. Some of them appear to be the first part of a PDF file, though they lack the %%EOF and other footer values.
Others appear to be the last part of PDF files (they don't have any of a PDF's headers but they do have the %%EOF stuff). Curiously they start with the following 16-byte magic header:
0x50, 0x4B, 0x57, 0x41, 0x52, 0x45, 0x00, 0x00, 0x00, 0x00, 0x00, 0x57, 0x49, 0x4E, 0x33, 0x32 (PKWARE WIN32).
I'm doing a lot of inference which could possibly be misleading, but it doesn't seem to be a compression scheme (the %%EOF stuff is plaintext) and in the few files I've been allowed to look at deeply there's a correlation between starting with this magic and looking like the final segment of a PDF binary.
Does anyone have any hints as to what file format might be at play here?
Update: I've now observed this PKWARE WIN32 happening on non-PDF files as well. Speculation also suggests that these files are split up in a similar manner.
Update 2: It turns out this PKWARE WIN32 header actually occurs in repeating intervals, the location of which can be predicted by some bytes immediately prior to the header.
I've also received some circumstantial hearsay which suggests that these files are compressed and not split into multiple parts, though in 2 out of the 3 cases where I was told the output file sizes my binaries were only negligibly smaller.
The mystery continues.
Okay, so this ended up being a very strange format. Overall it's a compression scheme, but it's applied inconsistently and lightly wrapped in a way that confounded the issue.
The first 8 bytes of any of these files will start with its own magic, and the next 8 bytes can be read as a long to tell us the final size of the output file.
Then there's a 16 byte "section" (four ints) whose first number is just an incremental counter, whose second int represents the number of bytes until the next "section" break, whose third int is a bit of a mystery to me, and whose fourth int is either 0 or 1. If that int is 0, just read the next (however many) bytes as-is. They're payload.
If it's 1 then you'll get one of these PKWARE headers next. I honestly know how to interpret them the least-well other than they start with the magic in the original question and they're 42 bytes long in total.
If you had a PKWARE header, subtract 42 from the number of bytes to read then treat the remaining bytes as compressed using PKWARE's "implode" algorithm. Meaning you can use zlib's "explode" implementation to decompress them.
Iterate through the file taking all these headers into account and cobbling together compressed and uncompressed parts 'til you run out of bytes and you'll end up with your output file.
I have no idea why only parts of files are compressed nor why they've been broken into blocks like this but it seems to work for the limited sample data I have. Perhaps later on I'll find files that actually have been split up along those boundaries or employ some kind of fancy deduplication but at least now I can explain why it looked like I saw partial PDFs -- the files were only partially compressed.

Maximum size of xattr in OS X

I would like to use xattr to store some meta-data on my files directly on the files. These are essentially tags that I use for the categorization of files when I do searches on them. My goal is to extend the usual Mac OS X tags by associating more info to each tag, for instance the date of addition of that tag and maybe other thing.
I was thinking to add an xattr to the files, using xattr -w. My first guess would be to store something like a JSON in this xattr value, but I was wondering
1) what are the limits of size I can store in an xattr? (man of xattr is vauge and refers to something called _PC_XATTR_SIZE_BITS which I cannot locate anywhere)
2) anything wrong with storing a JSON formatted string as an xattr?
According to man pathconf, there is a “configurable system limit or option variable” called _PC_XATTR_SIZE_BITS which is
the number of bits used to store maximum extended attribute size in bytes. For
example, if the maximum attribute size supported by a file system is 128K, the value
returned will be 18. However a value 18 can mean that the maximum attribute size can be
anywhere from (256KB - 1) to 128KB. As a special case, the resource fork can have much
larger size, and some file system specific extended attributes can have smaller and preset
size; for example, Finder Info is always 32 bytes.
You can determine the value of this parameter using this small command line tool written in Swift 4:
import Foundation
let args = CommandLine.arguments.dropFirst()
guard let pathArg = args.first else {
print ("File path argument missing!")
exit (EXIT_FAILURE)
}
let v = pathconf(pathArg, _PC_XATTR_SIZE_BITS)
print ("_PC_XATTR_SIZE_BITS: \(v)")
exit (EXIT_SUCCESS)
I get:
31 bits for HFS+ on OS X 10.11
64 bits for APFS on macOS 10.13
as the number of bits used to store maximum extended attribute size. These imply that the actual maximum xattr sizes are somewhere in the ranges
1 GiB ≤ maximum < 2 GiB for HFS+ on OS X 10.11
8 EiB ≤ maximum < 16 EiB for APFS on macOS 10.13
I seem to be able to write at least 260kB, like this by generating 260kB of nulls and converting them to the letter a so I can see them:
xattr -w myattr "$(dd if=/dev/zero bs=260000 count=1|tr '\0' a)" fred
1+0 records in
1+0 records out
260000 bytes transferred in 0.010303 secs (25235318 bytes/sec)
And then read them back with:
xattr -l fred
myattr: aaaaaaaaaaaaaaaaaa...aaa
And check the length returned:
xattr -l fred | wc -c
260009
I suspect this is actually a limit of ARGMAX on the command line:
sysctl kern.argmax
kern.argmax: 262144
Also, just because you can store 260kB in an xattr, that does not mean it is advisable. I don't know about HFS+, but on some Unixy filesystems, the attributes can be stored directly in the inode, but if you go over a certain limit, additional space has to be allocated on disk for the data.
——-
With the advent of High Sierra and APFS to replace HFS+, be sure to test on both filesystems - also make sure that Time Machine backs up and restores the data as well and that utilities such as ditto, tar and the Finder propagate them when copying/moving/archiving files.
Also consider what happens when Email a tagged file, or copy it to a FAT-formatted USB Memory Stick.
I also tried setting multiple attributes on a single file and the following script successfully wrote 1,000 attributes (called attr-0, attr-1 ... attr-999) each of 260kB to a single file - meaning that the file effectively carries 260MB of attributes:
#!/bin/bash
for ((a=1;a<=1000;a++)) ; do
echo Setting attr-$a
xattr -w attr-$a "$(dd if=/dev/zero bs=260000 count=1 2> /dev/null | tr '\0' a)" fred
if [ $? -ne 0 ]; then
echo ERROR: Failed to set attr
exit
fi
done
These can all be seen and read back too - I checked.

Really 1 KB (KiloByte) equals 1024 bytes?

Until now I believed that 1024 bytes equals 1 KB (kilobyte) but I was reading on the internet about decimal and binary system.
So, actually 1024 bytes = 1 KB would be the correct way to define or simply there is a general confusion?
What you are seeing is a marketing stunt.
Since non-technical people don't know the difference between Metric Meg, Gig, etc. against the binary Meg, Gig, etc. marketers for storage will use the Metric calculation, thus 1000 Bytes == 1 KiloByte.
This can cause issues with development or highly technical people so you get the idea of a binary Meg, Gig, etc. which is designated with a bi instead of the standard combination (ex. Mebibyte vs Megabyte, or Gibibyte vs Gigabyte)
There are two ways to represent big numbers: You could either display them in multiples of 1000 (base 10) or 1024 (base 2). If you divide by 1000, you probably use the SI prefix names, if you divide by 1024, you probably use the IEC prefix names. The problem starts with dividing by 1024. Many applications use the SI prefix names for it and some use the IEC prefix names. But it is important how it is written:
Using IEC standard:
1 KiB = 1,024 bytes (Note: big K)
1 MiB = 1,024 KiB = 1,048,576 bytes
Using SI standard:
1 kB = 1,000 bytes (Note: small k)
1 MB = 1,000 kB = 1,000,000 bytes
Source: ubunty units policy: https://wiki.ubuntu.com/UnitsPolicy
In the normal world, most things go by the power of 10. This would include electricity, for example.
But, in the computer world, it is about half binary. For example, when they sell a hard drive, they sell it by the value of 10, so if it is a 1KB drive, then it is 1000 B. But, when the computer reads it, the OS's usually read by the value of 1024. This is why, when you read the size of space available on a drive, it reads much less then what it was advertised. A 500 GB drive will read only about 466GB, because the computer is reading the drive by the binary 1024 version. Not the power of 10 that it was sold and advertised by. Same will go with flash drives. But, RAM is sold, and read by the computer, by the Binary 1024 version.
One thing to note.. It is "B", not "b". There are 8 bits "b" in a Byte "B". The reason I bring this up is when you get internet service, they usually advertise the speed by bits, not bytes. When it reads in the download box on the computer, it reads the speed in bytes. Say you have a 50Mb internet connection, it is actually 6.25MB connection in the download speed box, because you have to divide the 50 by 8 since there are 8 bits in a byte. That is how the computer reads it. Another marking strategy too. After all, 50Mb sounds much faster then 6.25MB. Other then speeds through a network, most things are read by bytes "B". Some people do not realize that there is a difference between the "B" and "b".
Quite simple...
The word 'Byte' is a computing reference for which the letter 'B' is used as abbreviation.
It must follow then that any reference to Bytes, eg. KB, MB etc, must be based on the well known and widely accepted 1024 base.
Therefore 1KB must equal 1024 Bytes, 1MB must equal 1048576 Bytes (1024x1024) etc.
Any non-computing reference to Kilo/Mega etc. Is based on the decimal 1000 base, eg. 1KW or 1KiloWatt which is 1000 Watts.

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Is there a fast way of adding or removing content in the middle of a very large file

Say I have a very large file (say > 1GB) and I want to add a single character in the middle of it. Is it possible to do this without reading and writing the whole file out? My current solution is this (in pseudocode):
x = 0
chunk = read 4KB chunk x of input file
if chunkToEdit = x, chunk = addCharacter(chunk)
append chunk to the output file
x = x + 1
repeat last 4 steps until input file is fully read
delete input file
move output file to input file
While that works, it results in 1GB of reading, and 1GB of writing to make a single character change. It also requires a spare 1GB of disk space. What I would rather do is modify the part of the file that needs to be changed in place, so I only have to read and write one part of the file (ie 4KB of reading, and 4KB of writing). Is this possible (or a solution better than my one)?
I thought a solution for this could be possible by the OS fragmenting the file and making a new fragment for the changed section, but I don't know if this capability has been written and exposed to developers.
No. Files don't work like that. If you need to change the size of the file then you need to operate from the modification point to the end.
Unless you're using a file format that can handle insertions/deletions cleanly, but it sounds like you aren't.
Adding a single character in the middle necessarily requires shifting everything after this one character by one character. This necessarily requires that you read and write everything from the point of insertion to the end of the file. A way that uses as little memory as possible to do so would be:
i = 0
read last (n byte * i) of file
write back to file shifted by 1 character
i++
repeat until reaching the point of insertion
write single character
In other words: shift everything in chunks of n bytes by one character starting from the end going backwards through the file to the point of insertion, then insert the character. The farther back in the file you want to insert the character, the faster this will be. If you often want to insert near the beginning of the file, this may not be the best solution.