error-correcting code checksum - binary

Question! : Adding all bytes together gives 118h.
Drop the Carry Nibble to give you 18h. I can't get this word 'Carry Nible'.
If I make checksum for this byte 10010101(95hex), then the checksum is 4(04hex)?
source : http://www.asic-world.com/digital/numbering4.html#Error_Detecting_and_Correction_Codes
"
The parity method is calculated over byte, word or double word. But when errors need to be checked over 128 bytes or more (basically blocks of data), then calculating parity is not the right way. So we have checksum, which allows to check for errors on block of data. There are many variations of checksum.
Adding all bytes
CRC
Fletcher's checksum
Adler-32
The simplest form of checksum, which simply adds up the asserted bits in the data, cannot detect a number of types of errors. In particular, such a checksum is not changed by:
Reordering of the bytes in the message
Inserting or deleting zero-valued bytes
Multiple errors which sum to zero
Example of Checksum : Given 4 bytes of data (can be done with any number of bytes): 25h, 62h, 3Fh, 52h
Adding all bytes together gives 118h.
Drop the Carry Nibble to give you 18h.
Get the two's complement of the 18h to get E8h. This is the checksum byte.
To Test the Checksum byte simply add it to the original group of bytes. This should give you 200h.
Drop the carry nibble again giving 00h. Since it is 00h this means the checksum means the bytes were probably not changed."

Related

Revers Engineering CRC 32 in firmware

i have a p-flash (size is about 700kb) in this flash there is a CRC32. I know where it is, and i know the CRC calculation method (polynomial, initial value, final Xor Value, input and output reflected) the problem is that just a part of these 700kb are used to calculate the crc. And i don't know which part. Is there a way to find out the input data for the calculation?
I have 5 of these 700kb files. The files are all the same except 4 bytes that are different, and the 4 bytes of the crc.
If you can get the files onto a PC, that would help. You can xor any two of the files to get a file that is all zeroes except for the 4 different bytes and the 4 bytes of the CRC. The xor of two files will also eliminate any initial value or final xor value, as if initial value = 0 and final xor value = 0. Then check the nearly all zero file to see if the CRC matches what you would expect. If it matches, then you would know that the CRC includes the 4 non-zero bytes and all the zero bytes that follow, but you wouldn't know how far before the 4 non-zero bytes that the CRC includes in its calculation, but this would at least be a start. If it does match, that would reduce the amount of searching for what is included in the CRC calculation.
Assuming the part used for CRC is contiguous, you could do a brute force search using a fast CRC32. On a X86 with SSE2 (xmm) registers, an assembly based CRC32 could calculate a CRC32 for 700,000 bytes in about 0.0002 seconds on an Intel 3770K 3.5ghz a 3rd gen processor (they're faster now), or a bit more than 70 seconds to try lengths from 8 to 700,000 bytes.
I converted the code from this github example to Visual Studio asm, for both reflected and non-reflected CRC, using CRC32 and CRC32C polynomials, and I could upload the code if interested.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_01.asm

Recommended way to store a string in this case?

I am storing strings and 99.5+% are less than 255 characters, so I store them in a VARCHAR(255).
The thing is, some of them can be 4kb or so. What's the best way to store those?
Option #1: store them in another table with a pointer to the main.
Option #1.0: add an INT column with DEFAULT NULL and the pointer will be stored there
Option #1.1: the pointer will be stored in the VARCHAR(255) column, e.g 'AAAAAAAAAAA[NUMBER]AAAAAAAAAAAA'
Option #2: increase the size of VARCHAR from 255 to 32767
What's the best of the above, Option #1.0, Option #1.1 or Option #2, performance wise?
Increase the size of your field to fit the max size of your string. A VARCHAR will not use the space unless needed.
VARCHAR values are stored as a 1-byte or 2-byte length prefix plus
data. The length prefix indicates the number of bytes in the value. A
column uses one length byte if values require no more than 255 bytes,
two length bytes if values may require more than 255 bytes.
http://dev.mysql.com/doc/refman/5.0/en/char.html
The MySQL Definition says that VARCHAR(N) will take up to L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes where L is the length in bytes of the stored string.
So I guess that option #2 is quite okay, because the small strings will still take less space than 32767 bytes.
EDIT:
Also imagine the countless problems options 1.0 and 1.1 would raise when you actually want to query a string without knowing whether it exceeds the length or not.
Option #2 is clearly best. It just adds 1 byte to the size of each value, and doesn't require any complicated joins to merge in the fields from the second table.

Correct way to store a bit array

I'm working on a project that needs to store something like
101110101010100011010101001
into the database. It's not a file or archive: it's only a bit array, and I think that storing it into a varchar column is waste of space/performance.
I've searched about the BLOB and the VARBINARY type. But both of then allows to insert a value like 54563423523515453453, that's not exactly a bit array.
For sure, if I store a bit array like 10001000 into a BLOB/varbinary/varchar column, it will consume more than a byte, and I want that the minimum space is consumed. In the case of eight bits, it needs to consume only one byte, 16 bits two bytes, and so on.
If it's not possible, then what is the best approach to waste the minimum amount of space in this case?
Important notes: The size of the array is variable, and is not divisible by eight in every situation. Sometimes I will need to store 325 bits, other times 7143 bits....
In one of my previous projects, I converted streams of 1's and 0' to decimal, but they were shorter. I dont know if that would be applicable in your project.
On the other hand, imho, you should clarify what will you need to do with that data once you get it stored. Search? Compare? It might largely depend on the purpose of the database.
Could you gzip it and then store it? Is that applicable?
Binary is a string representation of a number. The string
101110101010100011010101001
represents the number
... + 1*25 + 0*24 + 1*23 + 0*22 + 0*21 + 1*20
As such, it can be stored in a 32-bit integer if were to be converted from a binary string to the number it represents. In Perl, one would use
oct('0b'.$binary)
But you have a variable number of bits. Not a problem! Just process them 8 at a time to create a string of bytes to place in a BLOB or similar.
Ah, but there's a catch. You'll need to add padding to get a number divisible by 8, which means you'll have to use a means of removing that padding. A simple approach if there's a known maximum length is to use a length prefix. e.g. If you know the number of bits is never going to exceed 65,535, encode the number of bits in the first two bytes of the string.
pack('nB*', length($binary), $binary)
which is reverted using
my ($length, $binary) = unpacked('nB*', $packed);
substr($binary, $length) = '';

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Confused with endianess: bits or bytes?

I extracted this from a tutorial:
Little-Endian order is the one we will be using in this document, and unless stated specifically you should assume that Little-Endian order is used in any file. The alternate is Big-Endian ordering. So let’s see an example. Take the following stream or 8 bits 10001110 If you have been following the document so far, you would quickly calculate the value of this 8-bit number as being 1x2^7 + 0x2^6 + … + 1x2^1 + 0x2^0 = 142 This is an example of Little-Endian ordering. However, in Big-Endian ordering we need to read the number in the opposite direction 1x2^0 + 0x2^1 + … + 1x2^6 + 0x2^7 = 113
Is this correct?
I used to think that endianess has to do with order that the BYTES (not the bits) are read.
Yes, in the context of memory/storage, endianness indeed refers to byte ordering (typically). What would it mean to say that e.g. the least-significant bit "comes first"?
Bit endianness is relevant in some situations, for instance when sending data over a serial bus.
You are correct - that quote you have there is rubbish, IMHO.
It wouldn't make sense to reorder bits, and it would be pretty confusing to boot. CPUs don't read simgle bits, they read bytes, or combinations of bytes, at one time, so that's the ordering that's important.
When they store a number made up of multiple bytes, they can either store it from left to right, making the high-order byte lowest in memory, or right to left, with the low-order byte lowest in memory.