I trying to parse GIF format and have one problem with reading image data.
This data represented like bit array, containing variable-length values.
ex:
0010-1010-0010-0000-00111-10000-11111...
Sometimes length of the code increases, but I can't understand how I can detect this increasing.
I have just initial code size (length of the first code ex. 4).
Standart says only:
Appendix F. Variable-Length-Code LZW Compression.
...
The Variable-Length-Code aspect of the algorithm is based on an initial code size
(LZW-initial code size), which specifies the initial number of bits used for
the compression codes. When the number of patterns detected by the compressor
in the input stream exceeds the number of patterns encodable with the current
number of bits, the number of bits per LZW code is increased by one.
...
When parsing a GIF file, the Image Descriptor includes the bit width of the unencoded symbols (example: 8 bits).
As you probably already know, the initial code size of the compressed data is one bit wider than the bit width of the unencoded symbols (example: 9 bits).
Also, as you probably already know, the possible compressed code values in a GIF file gradually increase in size,
up to a maximum of 0xFFF == 4095 which requires 12 bits to store.
For every code that the decompressor pulls from the compressed data,
the decompressor adds a new item to its dictionary.
For example, if the first two 9-bit codes the decompressor reads are 0x028 0x0FF,
the decompressor adds a two-byte sequence to its dictionary.
Later if the decompressor ever reads the code 0x102,
the decompressor decodes that 0x102 code to the two 8-bit bytes 0x28 0xFF.
The next item the decompressor adds to the dictionary is assigned the code 0x103.
The next item the decompressor adds to the dictionary is assigned the code 0x104. ...
Eventually the decompressor adds an item to the dictionary that is assigned the code 0x1FF.
That is the largest number that fits into 9 bits.
After storing that item into the dictionary,
the decompressor starts reading 10-bit codes.
The next item the decompressor adds to the dictionary is assigned the code 0x200.
There isn't any special "command" in the data sequence that tells the decompressor to increment the code width.
The decompressor must keep track of how many items the dictionary contains so far (which often can be conveniently re-used as the index of where to write the next item into the dictionary).
When the decompressor adds item 0x1ff to the dictionary, it's time for the decompressor to start reading 10-bit codes.
When the decompressor adds item 0x3ff to the dictionary, it's time for the decompressor to start reading 11 bit codes.
Wikipedia: Graphics Interchange Format
Wikipedia: Lempel–Ziv–Welch with variable-length codes
Look at this example first, it may be clearer to understand LZW than looking at the standard.
And this may also be useful.
Related
I've been working on an RFID project to produce our own RFID cards to work on our existing timeclocks and readers.
I've got most of the work done, and have been able to successfully write a Hitag2 card using the value of page 4 & 5 from another card (so basically copying the card) then changing the config bit which makes it act like an EM4x02 which allows our readers to read it.
What I'm struggling with is trying to relate the hex code on page4/5 to the output you get when scanning as an EM4x..
The values of the hitag page 4/5 are FF800000/003EDF10. This translates to 0000001EBC when read as an EM4x.
Does anybody have an idea on how this translation is done? I've tried using the methods in RFIDIOT but that doesn't seem to work for this.
I've managed to find how this is done after finding a hitag2 datasheet from 1999 (the only one I could find that explains the bits when hitag is in public mode A)
Firstly, convert the number you want on the EM4 card to hex.
Convert that hex into binary.
Split the binary into 4 bit chunks, then work out the even parity for each section and add it to the end of each chunk. (So you'll end up with 5 bits per chunk)
Then, work out the even parity of each column in the data (i.e first character of all chunks, then second etc. But ignoring the parity bit you added) and add these 4 bytes to the binary string.
Then add the correct amount of zeros at the start to ensure the data section has 50 bits.
Once you have the data section sorted, add 9 bits of 1 to the beginning (header) and a final 0 to the very end of the binary.
Your whole binary string should be 64 bits long.
Convert this to hex and split it in half. You can then write these onto pages 4/5 of a Hitag2 card.
You then need to change the configuration bit to 0x02 for the tag to work in public mode a.
Just thought I would send you the diagram of how this works.Em4X tag data
I've got a bunch of files that from metadata I can tell are supposed to be PDFs. Some of them are indeed complete PDFs. Some of them appear to be the first part of a PDF file, though they lack the %%EOF and other footer values.
Others appear to be the last part of PDF files (they don't have any of a PDF's headers but they do have the %%EOF stuff). Curiously they start with the following 16-byte magic header:
0x50, 0x4B, 0x57, 0x41, 0x52, 0x45, 0x00, 0x00, 0x00, 0x00, 0x00, 0x57, 0x49, 0x4E, 0x33, 0x32 (PKWARE WIN32).
I'm doing a lot of inference which could possibly be misleading, but it doesn't seem to be a compression scheme (the %%EOF stuff is plaintext) and in the few files I've been allowed to look at deeply there's a correlation between starting with this magic and looking like the final segment of a PDF binary.
Does anyone have any hints as to what file format might be at play here?
Update: I've now observed this PKWARE WIN32 happening on non-PDF files as well. Speculation also suggests that these files are split up in a similar manner.
Update 2: It turns out this PKWARE WIN32 header actually occurs in repeating intervals, the location of which can be predicted by some bytes immediately prior to the header.
I've also received some circumstantial hearsay which suggests that these files are compressed and not split into multiple parts, though in 2 out of the 3 cases where I was told the output file sizes my binaries were only negligibly smaller.
The mystery continues.
Okay, so this ended up being a very strange format. Overall it's a compression scheme, but it's applied inconsistently and lightly wrapped in a way that confounded the issue.
The first 8 bytes of any of these files will start with its own magic, and the next 8 bytes can be read as a long to tell us the final size of the output file.
Then there's a 16 byte "section" (four ints) whose first number is just an incremental counter, whose second int represents the number of bytes until the next "section" break, whose third int is a bit of a mystery to me, and whose fourth int is either 0 or 1. If that int is 0, just read the next (however many) bytes as-is. They're payload.
If it's 1 then you'll get one of these PKWARE headers next. I honestly know how to interpret them the least-well other than they start with the magic in the original question and they're 42 bytes long in total.
If you had a PKWARE header, subtract 42 from the number of bytes to read then treat the remaining bytes as compressed using PKWARE's "implode" algorithm. Meaning you can use zlib's "explode" implementation to decompress them.
Iterate through the file taking all these headers into account and cobbling together compressed and uncompressed parts 'til you run out of bytes and you'll end up with your output file.
I have no idea why only parts of files are compressed nor why they've been broken into blocks like this but it seems to work for the limited sample data I have. Perhaps later on I'll find files that actually have been split up along those boundaries or employ some kind of fancy deduplication but at least now I can explain why it looked like I saw partial PDFs -- the files were only partially compressed.
The question
What are the ways of coercing octave to create a real copy of whatever object? Structures are the main interest.
My underlying problem
In my problem I'm obtaining a rather large structure from another function in a loop but for the current task only a few pieces of it are needed. For example:
for i=1:many
res=solver(params);
store1{i}=res.string1;
store2{i}=res.arr(:,1);
end
res is a sizable chunk of data and due to lazy-copy those store-s are references to tiny portions of bytes in that chunk. After I store those tiny portions, I don't need res itself, however, since middle of that chunk is referenced by store, the memory area is unfit for res obtained on the next iteration (they are of the same size) and thus another sizable piece of memory is allocated, which is then again crossed by few tiny links an so on.
Without storing parts of res, the program successfully keeps the memory consumption same after first couple of iterations.
So how do I make a complete copy of structure field?
I've tried using struct-related functions like rmfield but those keep references instead of their own objects.
I've tried to wrap the assignment of in its own function:
new_struct=copy( rmfield(old_struct,"bigdata"));
function c=copy(a);
c=a;
end;
This by the way doesn't work even for arrays.
I'm interested in method applicable to any generic variable.
Minimal working example of the problem
a=cell(3,1);
for i=1:length(a);
r=rand(100000,1000);
a{i}=r(1:100,end);
whos; fflush(stdout);
pause(2);
end;
The above code will cause memory usage to gradually grow by far more than 8.08 kb reported by whos due to references stored by a{i} blocking bigger memory block than they actually need. If you force the proper copy, the problem is not present.
Numerical arrays
For numeric types addition of zero is enough to warrant a new array.
c=a+0;
Strings
For string which is 1 x n char array, something along the following lines will work:
c=[a "a"](1:end-1);
Multidimensional char arrays will require concatenation with a column:
c=[a true(size(a,1),1)](:,1:end-1);
Here true is used to generate dummy array of size compatible with char. (There seems to be no procedural method of generating char array of arbitrary size) char(zeros(size(a,1),1)) and char(true(size(a,1),1)) caused excess memory usage during their creation on some calls.
Note that empty concatenation c=[a ""]; will not result in a copying. Also it is possible to do c=[a+0 ""]; which will result in a copying due to +0 but that one infers type conversions to and from double which is 8 times larger in size. (char(zeros( doesn't seem to cause that)
Other types
In general you can use casting for the types allowed by it in order to not tailor the expressions manually as I had to do above:
typelist={"double","single","char"}; %full list of supported types is available in the link
class_of_a = typelist{ isa(a,typelist) };
c=typecast( [typecast(a,'single'); single(1)] (1:end-1), class_of_a);
Single is seemingly smallest datatype available in octave.
Note that logical is not supported by this method.
Copying structures
Apparently you'd have to write your own function to go around struct fields, copy them with above methods and recursively go to substructs.
(As it doesn't involve complexities relevant here, I'd rather leave that to be done by those who actually needs that, my own problem being solved by +0's.)
I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.
I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?
Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.