Are DC coefficients included in every block of a compressed JPEG without exception? - binary

I am reading a compressed JPEG bitstream bitwise to locate the EOB markers.
After each EOB, I would expect to find a Huffman code representing the DC coefficient bit-size.
In the vast majority of cases this I have found this to be true, but occasionally there will be a long (~10 bit) string of 1's followed by a similar string of 0's after an EOB. No DC Huffman code defined in the DHT of the image produces a match. Is it possible that this block has no DC coefficient? Why would this be the case? If not, is there another explanation?
Assuming I am right in thinking that all markers are byte-aligned, there are no restart markers in the image. All bytes with a value of 255 are followed by a zero once the scan has started.

Related

What File Format Has This Magic Header?

I've got a bunch of files that from metadata I can tell are supposed to be PDFs. Some of them are indeed complete PDFs. Some of them appear to be the first part of a PDF file, though they lack the %%EOF and other footer values.
Others appear to be the last part of PDF files (they don't have any of a PDF's headers but they do have the %%EOF stuff). Curiously they start with the following 16-byte magic header:
0x50, 0x4B, 0x57, 0x41, 0x52, 0x45, 0x00, 0x00, 0x00, 0x00, 0x00, 0x57, 0x49, 0x4E, 0x33, 0x32 (PKWARE WIN32).
I'm doing a lot of inference which could possibly be misleading, but it doesn't seem to be a compression scheme (the %%EOF stuff is plaintext) and in the few files I've been allowed to look at deeply there's a correlation between starting with this magic and looking like the final segment of a PDF binary.
Does anyone have any hints as to what file format might be at play here?
Update: I've now observed this PKWARE WIN32 happening on non-PDF files as well. Speculation also suggests that these files are split up in a similar manner.
Update 2: It turns out this PKWARE WIN32 header actually occurs in repeating intervals, the location of which can be predicted by some bytes immediately prior to the header.
I've also received some circumstantial hearsay which suggests that these files are compressed and not split into multiple parts, though in 2 out of the 3 cases where I was told the output file sizes my binaries were only negligibly smaller.
The mystery continues.
Okay, so this ended up being a very strange format. Overall it's a compression scheme, but it's applied inconsistently and lightly wrapped in a way that confounded the issue.
The first 8 bytes of any of these files will start with its own magic, and the next 8 bytes can be read as a long to tell us the final size of the output file.
Then there's a 16 byte "section" (four ints) whose first number is just an incremental counter, whose second int represents the number of bytes until the next "section" break, whose third int is a bit of a mystery to me, and whose fourth int is either 0 or 1. If that int is 0, just read the next (however many) bytes as-is. They're payload.
If it's 1 then you'll get one of these PKWARE headers next. I honestly know how to interpret them the least-well other than they start with the magic in the original question and they're 42 bytes long in total.
If you had a PKWARE header, subtract 42 from the number of bytes to read then treat the remaining bytes as compressed using PKWARE's "implode" algorithm. Meaning you can use zlib's "explode" implementation to decompress them.
Iterate through the file taking all these headers into account and cobbling together compressed and uncompressed parts 'til you run out of bytes and you'll end up with your output file.
I have no idea why only parts of files are compressed nor why they've been broken into blocks like this but it seems to work for the limited sample data I have. Perhaps later on I'll find files that actually have been split up along those boundaries or employ some kind of fancy deduplication but at least now I can explain why it looked like I saw partial PDFs -- the files were only partially compressed.

Reverse-Engineering Memory Load Techniques?

I am attempting to reverse engineer a game (with permission). I am using IDA Pro. The functions are sub_xxxxx, meaning that they are protected functions.
However, the strings that would be the names for the functions, when looking at the only cross-reference, are shown in the following manner:
__data:xxxxxxxx DCD aEcdh_compute_k ; "ECDH_compute_key"
__data:xxxxxxxx DCB 0
__data:xxxxxxxx DCB 0x40
__data:xxxxxxxx DCB 12
__data:xxxxxxxx DCB 0x3B
Some of the numbers, including the DCBs are changed for the sake of safety (OCD)
I had attempted to use the 40 12 3B to use as an offset. However, the offset brings me to the middle of a random loc_xxxxx, along with the others.
My question to you is, how would I go about finding where the actual function is? Is the offset from the top of the .data segment? Or is it from the actual declaring string itself?
I do not expect or require a full answer; obviously this may not have been encountered in the past, and I may not have given enough information needed. (If you need more information, please ask, thanks). Basically, I am asking, "What should I try next?", trying to find the most likely answer. Thank you.
You're ignoring the processors endianity, which is usually little endian.
Hhitting D two times (once to convert data representation from single byte to word and another to convert it from word to dword) will convert the data to a dword for you. Alternatively, you could also hit O to directly convert data representation to an offset (which is of size dword on most architectures)
This is most likely to show you offset to address 0x003b1240, which is probably the address you were looking for.

Bulk export of binary waveform data from oscilloscope to data points (csv preferred)

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Binary translation | Cross compilation

Say you are writing compilers for different architectures.
The architectures have different endianness.
You have memory read and write instructions
Take example of a store instruction, where you want to store the value 0xAA0xBB0xCC0xDD.
Now while writing the assembly for this, do you write two different instructions for the
different architectures e.g.
For the little endian: st (reg), 0xDD0xCC0xBB0xAA
For the big endian: st (reg), 0xAA0xBB0xCC0xDD
Or you write the same instruction, say, st, (reg), 0xAA0xBB0xCC0xDD for both the architectures and let the instruction be parsed by the processor such that it takes care of the endianness of the system?
The reason why I ask this question is I don't know what a binary translator would do when it has to translate code between architectures of different endianness. If in Architecture A, you see the following line st, (reg), XY do you convert it into st, (reg), YX for the Architecture B ?? If that is the case, then what happens to memory reads?
I would like to know how to take care of endianness, considering memory reads and writes in binary translation.
Endianess has nothing to do with how memory is read or written, but instead it just means when memory is interpreted as a number, is the most significant byte first or last. It is only the implementation of the arithmetic which makes the difference.
So your binary translator, if such a thing even exist, won't change anything, it is just instructions like ADD, SUB and MUL which interpret numbers differently.
I'm not sure I understand your question fully, but it sounds like you want to translate some assembly-language code or a disassembled binary?
Every assembler I've ever worked with handles the endianness of constants in the sane way. That is to say, if you want to store 0xAABBCCDD, you would write:
st (reg), 0xAABBCCDD
And the assembler will swizzle the contstant if necessary for the appropriate opcode. Where endianness becomes a concern is where you want to store multiple single-byte values using that one operation. Something like writing a short null-terminated string "123" to memory using the same opcode. You have to swizzle that constant in your assembly code to get it output to memory in the right order for little- vs. big-endian systems:
st (reg), 0x31323300 // big-endian
st (reg), 0x00333231 // little-endian
The safe way is to just store the bytes in the order you want them:
stb (reg+0), 0x31
stb (reg+1), 0x32
stb (reg+2), 0x33
stb (reg+3), 0x00
But that takes four instructions, instead.