Permutation invariant binary codes - binary

I'm currently working on a problem where I have objects trivially numbered using a binary code. ie object 0 = 000, object 1 = 001, object 2 = 010 ... However, in my case I sometimes don't know where the start of the sequence is (ie 001 could appear as 010 or 100). Is there some optimal way that I can number my objects so that the labels are invariant to permutation? And what are the search terms I need to read up more about this myself?
An obvious solution would be to number them by the number of hot bits (ie 0=0000, 1=0001, 2=0011, 3=0111 ...) but this is a very inefficient code.

Based on the updated comments, a prefix-free code is sufficient. This will allow you to uniquely determine each symbol length. This solves the problem with not knowing where the next symbol starts.
Note that all fixed-length codes are trivially prefix-free. Just consider 8 bits ASCII: every 8 bits a new character starts, and each byte in the stream uniquely represents one ASCII character. The unique prefix length is exactly the fixed length.
There's a trivial conversion from any variable-length encoding that also would work here, despite not being prefix-free. Simply replace the start of each code with "11", each 0 bit with "01", each 1 bit with "10" and the end of each code with "00". Now "00" becomes "11 01 01 00" and "000" becomes "11 01 01 01 00". To parse this, you look for '11...00" to find the symbol boundaries, and then simply decode the bits in between. Depending on the exact stream/message layer, you might even be able to skip the "00".

Related

Need some assistance understanding binary addition/subtraction using 2's complement

If A = 01110011, B = 10010100, how would I add these?
I did this:
i.e: 01110011 + 10010100 = 100000111
Though, isn't it essentially 115 + (-108) = 7, whereas, I'm getting -249
Edit: I see that removing the highest order bit (overflow) I get 7 which is what I'm looking for but I'm not getting why you wouldn't have the extra bit.
Edit**: Ok, I figured it out. There was no overflow as I had assumed there was because 7 is within [-128, 127] (8-bits). Instead, like Omar hinted at I was supposed to drop the "extra" 1 from addition.
Your calculation is correct and the result is correct.
You stated that the second number is -108, so both your numbers are interpreted as signed 8-bit values. Thus, you should also interpret your result as an 8-bit signed value, this is why the 9th bit must be dropped, and so the result is 7 (00000111).
On a real hardware, like an 8-bit CPU for example, as all the registers are 8-bit wide, you are only be able to store the lowest 8-bit of the result, which here is 7 (00000111).
In some cases, the 9th bit may also be put inside a carry/overflow flag so it's not completely "dropped".

why heaxadecimal numbers are prefixed with "0* "

Instead of writing ffff why the syntax of writing heaxadecimal number's are like 0*ffff.What is the meaning of "0*". Does it specify something?
Anyhow A,B,C,D,E,F notations only in hexa decimal number system. Then whats the need of "0*".
Sorry "*" was not the character i supposed it is "x" .
Is it a nomenclature or notation for hexadecimal number systems.
I don't know what language you are talking about, but if you for example in C# write
var ffffff = "Some unrelated string";
...
var nowYouveDoneIt = ffffff;
what do you expect to happen? How does the compiler know if ffffff refers to the hexadecimal representation of the decimal number 16777215 or to the string variable defined earlier?
Since identifiers (in C#) can't begin with a number, prefixing with a 0 and some other character (in C# it's 0xffffff or hex and 0b111111111111111111111111 for binary IIRC) is a handy way of communicating what base the number literal is in.
EDIT: Another issue, if you were to write var myCoolNumber = 10, how would you have ANY way of knowing if this means 2, 10 or 16? Or something else entirely.
It's typically 0xFFFF: the letter, not the multiplication symbol.
As for why, 0x is just the most common convention, like how some programming languages allow binary to be prefixed by 0b. Prefixing a number with just 0 is typically reserved for octal, or base 8; they wanted a way to tell the machine that the following number is in hexadecimal, or base 16 (10 != 0b10 [2] != 010 [8] != 0x10 [16]). They typically omitted a small 'o' from identifying octal for human readability purposes.
Interestingly enough, most Assembly-based implementations I've come across use (or at least allow the use of) 0h instead or as well.
It's there to indicate the number as heX. It's not '*', it's 'x' actually.
See:
http://www.tutorialspoint.com/cprogramming/c_constants.htm

Why is JSON invalid if an integer begins with a leading zero?

I'm importing some JSON files into my Parse.com project, and I keep getting the error "invalid key:value pair".
It states that there is an unexpected "8".
Here's an example of my JSON:
}
"Manufacturer":"Manufacturer",
"Model":"THIS IS A STRING",
"Description":"",
"ItemNumber":"Number12345",
"UPC":083456789012,
"Cost":"$0.00",
"DealerPrice":" $0.00 ",
"MSRP":" $0.00 ",
}
If I update the JSON by either removing the 0 from "UPC":083456789012, or converting it to "UPC":"083456789012", it becomes valid.
Can JSON really not accept an integer that begins with 0, or is there a way around the problem?
A leading 0 indicates an octal number in JavaScript. An octal number cannot contain an 8; therefore, that number is invalid.
Moreover, JSON doesn't (officially) support octal numbers, so formally the JSON is invalid, even if the number would not contain an 8. Some parsers do support it though, which may lead to some confusion. Other parsers will recognize it as an invalid sequence and will throw an error, although the exact explanation they give may differ.
Solution: If you have a number, don't ever store it with leading zeroes. If you have a value that needs to have a leading zero, don't treat it as a number, but as a string. Store it with quotes around it.
In this case, you've got a UPC which needs to be 12 digits long and may contain leading zeroes. I think the best way to store it is as a string.
It is debatable, though. If you treat it as a barcode, seeing the leading 0 as an integral part of it, then string makes sense. Other types of barcodes can even contain alphabetic characters.
On the other hand. A UPC is a number, and the fact that it's left-padded with zeroes to 12 digits could be seen as a display property. Actually, if you left-pad it to 13 digits by adding an extra 0, you've got an EAN code, because EAN is a superset of UPC.
If you have a monetary amount, you might display it as € 7.30, while you store it as 7.3, so it could also make sense to store a product code as a number.
But that decision is up to you. I can only advice you to use a string, which is my personal preference for these codes, and if you choose a number, then you'll have to remove the 0 to make it work.
One of the more confusing parts of JavaScript is that if a number starts with a 0 that isn't immediately followed by a ., it represents an octal, not a decimal.
JSON borrows from JavaScript syntax but avoids confusing features, so simply bans numbers with leading zeros (unless then are followed by a .) outright.
Even if this wasn't the case, there would be no reason to expect the 0 to still be in the number when it was parsed since 02 and 2 are just difference representations of the same number (if you force decimal).
If the leading zero is important to your data, then you probably have a string and not a number.
"UPC":"083456789012"
A product code is an identifier, not something you do maths with. It should be a string.
Formally, it is because JSON uses DecimalIntegerLiteral in its JSONNumber production:
JSONNumber ::
-_opt DecimalIntegerLiteral JSONFraction_opt ExponentPart_opt
And DecimalIntegerLiteral may only start with 0 if it is 0:
DecimalIntegerLiteral ::
0
NonZeroDigit DecimalDigits_opt
The rationale behind is is probably:
In the JSON Grammar - to reuse constructs from the main ECMAScript grammar.
In the main ECMAScript grammar - to make it easier to distinguish DecimalIntegerLiteral from HexIntegerLiteral and OctalIntegerLiteral. OctalIntegerLiteral in the first place.
See this productions:
HexIntegerLiteral ::
0x HexDigit
0X HexDigit
HexIntegerLiteral HexDigit
...
OctalIntegerLiteral ::
0 OctalDigit
OctalIntegerLiteral OctalDigit
The UPC should be in string format. For the future you may also get other type of UPC such as GS128 or string based product identification codes. Set your DB column to be string.
If an integer start with 0 in JavaScript it is considered to be the Octal (base 8) value of the integer instead of the decimal (base 10) value. For example:
var a = 065; //Octal Value
var b = 53; //Decimal Value
a == b; //true
I think the easiest way to send your number by JSON is send your number as string.

In-memory layout of array in Turbo Pascal

We have an old application in Turbo Pascal which can save its internal state into a file, and we need to be able to read/write this file in a C# application.
The old application generates the file by dumping various in-memory data structures. In one place, the application just dumps a range of memory, and this memory range contains some arrays. I am trying to noodle out the purpose of the bytes immediately preceding the actual array elements. In particular, the first two items in the block can be represented as:
type
string2 = string[2];
stringarr2 = array[0..64] of string2;
string4 = string[4];
stringarr4 = array[0..64] of string4;
In the data file, I see the following byte sequence:
25 00 02 02 41 42 02 43 44 ...
The 25 is the number of elements in the array. The 02 41 42 is the first string element, "AB"; the 02 43 44 is the second string element, "CD", and so on. I don't know what the 00 02 between the array element count and the first array element refers to. It's possible the array element count is 25 00 and the element size is 02, but each array element is actually 3 bytes in size.
In the place in the file where the array of 4-character strings starts, I see the following:
25 00 04 00 00 04 41 42 43 44 04 45 46 47 48
Again, there's the 25 which is the number of elements in the array; 04 41 42 43 44 is the first element in the array, "ABCD", and so on. In between there are the bytes 00 04 00 00. Maybe they are flags. Maybe they are some kind of indication of the shape of the array (but I don't see how 02 and 04 both indicate a one-dimensional array).
I don't have access to Turbo Pascal to try writing different kinds of arrays to a file, and don't have authorization to install something like Free Pascal, so my opportunities for experimentation along those lines are very limited.
These arrays are not dynamic, since Turbo Pascal didn't have them.
Thanks in advance for any dusty memories.
Pascal arrays have no bookkeeping data. You have an array of five-byte data structures (string[4]), so an array of 65 of them occupies 65*5=325 bytes. If the program wrote more than that, then it's because the program took special measures to write more. The "extra" values weren't just sitting in memory that the program happened to write to disk when it naively wrote the whole data structure with SizeOf. Thus, the only way to know what those bytes mean is to find the source code or the documentation. Merely knowing that it's Turbo Pascal is no help.
It's possible that the first section of the file is intentionally the same size as all the other array elements. For the two-character strings, the "header" is three bytes, and for the four-character strings, the "header" is five bytes, the same as the size of the strings. That would have allowed the program to use a file of string4 data type for the file, and then just skip the file's first record. The zero between the file length and the string length in the header might belong to either of those fields, and the remaining two zero bytes might just be filler.
Besides the layout of the individual strings of characters in the file, you will also need to consider what code page those single-byte characters are from. C# chars are unicode 2 byte chars.
If you're lucky, the original file data contains only ASCII 7 bit characters, which covers characters of the English alphabet. If the original data contains "European" letters such as umlauts or accented characters, these will be "high ascii" char values in the range 128..255. You'll need to perform an encoding conversion to see these characters correctly in C#. Code page 1252 Windows Latin 1 would be a good starting point.
If the original file data contains Japanese, Chinese, Korean, Thai, or characters from other "Eastern" scripts, you have a lot of work ahead of you.
Turbo Pascal strings are prefixed with a length byte. So a string[2] is actually 3 bytes: length, char1 and char2. An array of string[2] will hold all the strings one by one directly after each other in memory. If you do a blockwrite with the array as a parameter it will immediately start with the first string, it will not write any headers etc. So if you have the source you should be able to see what it writes before the array.

How to used the alphabet binary symbols

I was reading an article on binary numbers and it had some practice problems at the end but it didn't give the solutions to the problems. The last is "How many bits are required to represent the alphabet?". Can tell me the answer to that question and briefly explain why?
Thanks.
You would only need 5 bits because you are counting to 26 (if we take only upper or lowercase letters). 5 bits will count up to 31, so you've actually got more space than you need. You can't use 4 because that only counts to 15.
If you want both upper and lowercase then 6 bits is your answer - 6 bits will happily count to 63, while your double alphabet has (2 * 24 = 48) characters, again leaving plenty of headroom.
It depends on your definition of alphabet. If you want to represent one character from the 26-letter Roman alphabet (A-Z), then you need log2(26) = 4.7 bits. Obviously, in practice, you'll need 5 bits.
However, given an infinite stream of characters, you could theoretically come up with an encoding scheme that got close to 4.7 bits (there just won't be a one-to-one mapping between individual characters and bit vectors any more).
If you're talking about representing actual human language, then you can get away with a far lower number than this (in the region of 1.5 bits/character), due to redundancy. But that's too complicated to get into in a single post here... (Google keywords are "entropy", and "information content").
There are 26 letters in the alphabet so you 2^5 = 32 is the minimum word length than contain all the letters.
How direct does the representation need to be? If you need 1:1 with no translation layer, then 5 bits will do. But if a translation layer is an option, then you can get away with less. Morse code, for example, can do it in 3 bits. :)