Shifting binary string to a canonical form - binary

I need to compare binary strings but I consider them to be the same if one is a circular shift of the other. For example, I need the following to be the same:
10011, 11100, 01110, 11001
I think it'll be easier to store them in a canonical way, and just check for equivalence. All strings to be compared has the same length (might be more than 100 bits), so I'll keep the length unchanged, and define the canonical form as the smallest binary number we can get from a circular shift. For example, 00111 is the canonical form of the binary strings shown above.
My question is, given a binary string, how can I circular shift it to get the smallest binary number without checking all possible shifts?
If other representation would be better, I'd be happy to receive suggestions.
I'll add, that flipping the string also doesn't matter, so the canonical form of 010011 is 001101 (if as described above) but if flipping is allowed, the canonical form should be 001011 (which I prefer). A possible solution is just to compute the canonization of the string and of the flipped string and choose the smaller.
If it helps, I'm working in MATLAB with binary vectors, but no need for code, an explanation of a way to solve that will be enough, thanks!

Find the longest sequence of 0s, if > 1 shift until the first is the msb.
oops edge case would be 00100, so if you have two equal longest sequences of zeros, you'd want the only possible 1 as the lsb as well.
Find the longest sequence of 1s, if > 1 shift until the last is the lsb
If there are no consecutive 1s or 0s shift until the lsb is 1
That said it's only five shifts so the brute force way you know will work could easily be less effort...

Related

can we interpret negative binary as positive too(read the question please)?

I already know the concept of negative binary numbers, The 0 at the position most significant bit represents that the binary is positive and 1 at the position of most significant bit represents that the binary is negative.
BUT THE PROBLEM THAT INTIMIDATED ME TO ASK A QUESTION ON STACKOVERFLOW IS THIS:
what about the times that we might want to represent a huge number that it's representation has occurred to have 1 in msb.
let me explain it in this way: by considering the above rule for making negative counterparts of our binary numbers we could say that ;
in an 8-bit system we have, For example, a value of positive 12 (decimal) would be written as 00001100 in binary, but negative 12 (decimal) would be written as
10001100 but what makes me confused a bit is that 10001100 could also be interpreted as 268 in decimal while we know that its the negative form of 12 in binary using this method of conversion.
I just want to know how to deal with this tricky, two-faced possible ways of interpreting a binary number, just like the example i gave above(it seem's to be negative, OH! but wait it might also not be:).
It depends on the type you use. If you're using an 8-bit representation which is signed, then the largest number you can store is 1111111 (i.e. the first bit is set aside).
In our example, that would convert to an integer value of 127.
If you use an unsigned type, then you have the extra bit available, allowing you to store 11111111, or 255 expressed as an integer.
A strongly typed language with a good compiler should catch you trying to assign, say, 134 to a signed 8 bit integer and vomit errors all over you.
If you're doing something strange fiddling around with bits yourself, you're on your own! There's no way of reconstructing, post hoc, whether it was intended to be a negative or a large positive, so you'll have to choose a system and stick with it.
The general convention nowadays is to stick with signed representations always - although I have seen code (usually for extreme compute tasks like astrophysical calculations) use unsigned values simply to save memory. And of course images will use unsigned values by convention, usually.

Turing Machine to compare multiple binary numbers

I was wondering how to construct a Turing Machine for
A<B<C<D...<N
with all numbers (A,B,C,D,...,N) being positive binary numbers.
These are a couple examples of how the machine should work:
1001 - Accepts because there is only one number
0<1 - Accepts
0010<1000<0001 - Doesn't accept because 1000!<0001
0100<1010<1010<1000 - Doesn't accept because 1010!<1010
I've tried methods that work to compare only two numbers but I can't seem to find a way to compare multiple (should work for infinite number of inputs) numbers.
Here is a high-level block diagram for solving this problem. You can implement these blocks by using block feature of JFLAP.
Blocks Description
Done? : This block decides whether all comparisons are done, if yes, it accepts, otherwise, it goes to compare the very next two numbers.
A<B : This block is responsible to compare two binary numbers, the one that cursor is pointing at the first digit and the next one. You can use '<' as the separator between A and B and the next number.
cleanup: during the comparison, you might marked 0's and 1's to something else. This block clean up everything and prepare everything for the next comparison.
Hopefully, this gives you an idea to solve the problem.

Minimum swaps to create an alternating binary string of ones and zeros

Given a binary string, that is it contains only 0s and 1s (number of zeros equals the number of ones) We need to make this string a sequence of alternate characters by swapping some of the bits, our goal is to minimize the number swaps.
For example, for the string "00011011" the minimum number of swaps is 2, one way to do it is:
1) swap the bits : 00011011 --->> 00010111
2) swap the bits(after the first swap) : 00010111 --->> 01010101
Note that if we are given the string "00101011" we can turn it into an alternate string starting with 0 (that requires 3 swaps) and also into alternate string starting with 1 ( that requires one swap - the first and the last bits ).
So the minimum in this case is one swap.
The end goal is to return the minimum number of swaps for a given string of ones and zeros.
What is the most efficient way to solve it?
Trivial stuff is trivial.
xor your sequence with 010101010…. Note: if the sequence was already "correct", you'd get an all-0 bit stream (also do with the inverse if you don't care whether you start with 0 or 1). XOR is very efficient on modern CPUs. I can do 512 bits at a time on my Xeon, and I think it takes 2 cycles.
count the ones (#ones). On modern x87, and a lot of other ISAs, there's a POPCNT instruction that makes that extremely efficient. Throughput one, 64 bits at a time on my CPU.
swaps necessary is #ones/2 (obvious), or #zeros/2, whichever is less.
Note: Code published on this site is CC-by-SA. That means that you're legally required to state the URL of this answer and my username when you use this in your software, or your homework, or your exam, or your assignment.

Huffman Coding: handling negative ambiguity with zero

I've written a simple text file compressor that uses Huffman coding. I encode the text and write the binary resulting from Huffman to a file. To decode, I read in the binary and step through the Huffman tree.
That part is straightforward. The problem arises with 0 and negative numbers. For practice/fun/learning, I decided to do my own binary conversion methods (from a Java byte to a string and vice-versa) and I decided to represent negative numbers by flipping the last bit to a 1.
E.g, -2 = 00000101;; 2 = 00000100 (the extra 0's for padding since even the unnecessary 0's are important in Huffman... it's irrelevant, though)
However, 0 = 00000000 = 00000001
This may not seem like a problem, but those two binary strings map to two different characters in the huffman tree.
Is there a better way handle negatives in binary that will get around this?
I'm not sure this will help you, but i will try:
First of all, there is different kind of binary, pure or the others. Binary pure DON'T allow negatives, it goes from 0.......
You can use magnitude and sign, another kind of binnary, it allows negative numbers, and the - or + sign is represented with the most important bit of the number, for example:
A number with 4 bits:
0100=2
1100=-2
(1 bit for the sign, the most important, the first left one, and the other 3 for the number)
You can use too the Two's complement, but it's harder and you need to get the number in binary and then translate it to the other type.
I hope i could help you, and sorry for the lot of mistakes in english!

Best practice for storing weights in a SQL database?

An application I'm working on needs to store weights of the format X pounds, y.y ounces. The database is MySQL, but I imagine this is DB agnostic.
I can think of three ways to do this:
Convert the weight to decimal pounds and store in a single field. (5 lbs 6.2 oz = 5.33671875 lbs)
Convert the weight to decimal ounces and store in a single field. (5 lbs 6.2 oz = 86.2 oz)
Store the pounds portion as an integer and the ounces portion as a decimal, in two fields.
I'm thinking that #1 is not such a good idea, since decimal pounds will produce numbers of arbitrary precision, which would need to be stored as a float, which could lead to inaccuracies which are inherent in floating point numbers.
Is there a compelling reason to choose #2 over #3 or vise-versa?
TL;DR
Choose either option #1 or option #2—there's no difference between them. Don't use option #3, because it's awkward to work with.
You claim that there are inherent inaccuracies in floating point numbers. I think that this deserves to be explored a little first.
When deciding upon a numeral system for representing a number (whether on a piece of paper, in a computer circuit, or elsewhere), there are two separate issues to consider:
its basis; and
its format.
Pick a base, any base…
Limited by finite space, one cannot represent an arbitrary member of an infinite set. For example: no matter how much paper you buy or how small your handwriting, it'd always be possible to find an integer that won't fit in the given space (you could just keep appending extra digits until the paper runs out). So, with integers, we usually restrict our finite space to representing only those that fall within some particular interval—e.g. if we have space for the positive/negative sign and three digits, we might restrict ourselves to the interval [-999,+999].
Every non-empty interval contains an infinite set of real numbers. In other words, no matter what interval one takes over the real numbers—be it [-999,+999], [0,1], [0.000001,0.000002] or anything else—there is still an infinite set of reals within that interval (one need only keep appending (non-zero) fractional digits)! Therefore arbitrary real numbers must always be "rounded" to something that can be represented in finite space.
The set of real numbers that can be represented in finite space depends upon the numeral system that is used. In our (familiar) positional base-10 system, finite space will suffice for one-half (0.510) but not for one-third (0.33333…10); by contrast, in the (less familiar) positional base-9 system, it is the other way around (those same numbers are respectively 0.44444…9 and 0.39). The consequence of all this is that some numbers that can be represented using only a small amount of space in positional base-10 (and therefore appear to be very "round" to us humans), e.g. one-tenth, would actually require infinite binary circuits to be stored precisely (and therefore don't appear to be very "round" to our digital friends)! Notably, since 2 is a factor of 10, the same is not true in reverse: any number that can be represented with finite binary can also be represented with finite decimal.
We can't do any better for continuous quantities. Ultimately such quantities must use a finite representation in some numeral system: it's arbitrary whether that system happens to be easy on computer circuits, on human fingers, on something else or on nothing at all—whichever system is used, the value must be rounded and therefore it always results in "representation error".
In other words, even if one has a perfectly accurate measuring instrument (which is physically impossible), then any measurement it reports will already have been rounded to a number that happens to fit on its display (in whatever base it uses—typically decimal, for obvious reasons). So, "86.2 oz" is never actually "86.2 oz" but rather a representation of "something between 86.1500000... oz and 86.2499999... oz". (Actually, because in reality the instrument is imperfect, all we can ever really say is that we have some degree of confidence that the actual value falls within that interval—but that is definitely departing some way from the point here).
But we can do better for discrete quantities. Such values are not "arbitrary real numbers" and therefore none of the above applies to them: they can be represented exactly in the numeral system in which they were defined—and indeed, should be (as converting to another numeral system and truncating to a finite length would result in rounding to an inexact number). Computers can (inefficiently) handle such situations by representing the number as a string: e.g. consider ASCII or BCD encoding.
Apply a format…
Since it's a property of the numeral system's (somewhat arbitrary) basis, whether or not a value appears to be "round" has no bearing on its precision. That's a really important observation, which runs counter to many people's intuition (and it's the reason I spent so much time explaining numerical basis above).
Precision is instead determined by how many significant figures a representation has. We need a storage format that is capable of recording our values to at least as many significant figures as we consider them to be correct. Taking by way of example values that we consider to be correct when stated as 86.2 and 0.0000862, the two most common options are:
Fixed point, where the number of significant figures depends on magnitude: e.g. in fixed 5-decimal-point representation, our values would be stored as 86.20000 and 0.00009 (and therefore have 7 and 1 significant figures of precision respectively). In this example, precision has been lost in the latter value (and indeed, it wouldn't take much more for us to have been totally unable to represent anything of significance); and the former value stored false precision, which is a waste of our finite space (and indeed, it wouldn't take much more for the value to become so large that it overflows the storage capacity).
A common example of when this format might be appropriate is for an accounting system: monetary sums must usually be tracked to the penny irrespective of their magnitude (therefore less precision is required for small values, and more precision is required for large values). As it happens, currency is usually also considered to be discrete (pennies are indivisible), so this is also a good example of a situation where a particular basis (decimal for most modern currencies) is desirable to avoid the representation errors discussed above.
One usually implements fixed point storage by treating one's values as quotients over a common denominator and storing the numerator as an integer. In our example, the common denominator could be 105, so instead of 86.20000 and 0.00009 one would store the integers 8620000 and 9 and remember that they must be divided by 100000.
Floating point, where the number of significant figures is constant irrespective of magnitude: e.g. in 5-significant-figure decimal representation, our values would be stored as 86.200 and 0.000086200 (and, by definition, have 5 significant figures of precision both times). In this example, both values have been stored without any loss of precision; and they both also have the same amount of false precision, which is less wasteful (and we can therefore use our finite space to represent a far greater range of values—both large and small).
A common example of when this format might be appropriate is for recording any real world measurements: the precision of measuring instruments (which all suffer from both systematic and random errors) is fairly constant irrespective of scale so, given sufficient significant figures (typically around 3 or 4 digits), absolutely no precision is lost even if a change of base resulted in rounding to a different number.
One usually implements floating point storage by treating one's values as integer significands with integer exponents. In our example, the significand could be 86200 for both values whereupon the (base-10) exponents would be -4 and -9 respectively.
But how precise are the floating point storage formats used by our computers?
An IEEE754 single precision (binary32) floating point number has 24 bits, or log10(224) (over 7) digits, of significance—i.e. it has a tolerance of less than ±0.000006%. In other words, it is more precise than saying "86.20000".
An IEEE754 double precision (binary64) floating point number has 53 bits, or log10(253) (almost 16) digits, of significance—i.e. it has a tolerance of just over ±0.00000000000001%. In other words, it is more precise than saying "86.2000000000000".
The most important thing to realise is that these formats are, respectively, over ten thousand and over one trillion times more precise than saying "86.2"—even though exact conversions of the binary back into decimal happens to include erroneous false precision (which we must ignore: more on this shortly)!
Notice also that both fixed and floating point formats will result in loss of precision when a value is known more precisely than the format supports. Such rounding errors can propagate in arithmetic operations to yield apparently erroneous results (which no doubt explains your reference to the "inherent inaccuracies" of floating point numbers): for example, 1⁄3 × 3000 in 5-place fixed point would yield 999.99000 rather than 1000.00000; and 1⁄7 − 7⁄50 in 5-significant figure floating point would yield 0.0028600 rather than 0.0028571.
The field of numerical analysis is dedicated to understanding these effects, but it is important to realise that any usable system (even performing calculations in your head) is vulnerable to such problems because no method of calculation that is guaranteed to terminate can ever offer infinite precision: consider, for example, how to calculate the area of a circle—there will necessarily be loss of precision in the value used for π, which will propagate into the result.
Conclusion
Real world measurements should use binary floating point: it's fast, compact, extremely precise and no worse than anything else (including the decimal version from which you started). Since MySQL's floating-point datatypes are IEEE754, this is exactly what they offer.
Currency applications should use denary fixed point: whilst it's slow and wastes memory, it ensures both that values are not rounded to inexact quantities and that pennies are not lost on large monetary sums. Since MySQL's fixed-point datatypes are BCD-encoded strings, this is exactly what they offer.
Finally, bear in mind that programming languages usually represent fractional values using binary floating-point types: so if your database stores values in another format, you need to be careful how they are brought into your application or else they may get converted (with all the ensuing issues that entails) at the interface.
Which option is best in this case?
Hopefully I've convinced you that your values can safely (and should) be stored in floating point types without worrying too much about any "inaccuracies"? Remember, they're more precise than your flimsy 3-significant-digit decimal representation ever was: you just have to ignore false precision (but one must always do that anyway, even if using a fixed-point decimal format).
As for your question: choose either option 1 or 2 over option 3—it makes comparisons easier (for example, to find the maximal mass, one could just use MAX(mass), whereas to do it efficiently across two columns would require some nesting).
Between those two, it doesn’t matter which one chooses—floating point numbers are stored with a constant number of significant bits irrespective of their scale.
Furthermore, whilst in the general case it could happen that some values are rounded to binary numbers that are closer to their original decimal representation using option 1 whilst simultaneously others are rounded to binary numbers that are closer to their original decimal representation using option 2, as we shall shortly see such representation errors only manifest within the false precision that should always be ignored.
However, in this case, because it happens that there are 16 ounces to 1 pound (and 16 is a power of 2), the relative differences between original decimal values and stored binary numbers using the two approaches is identical:
5.387510 (not 5.3367187510 as stated in your question) would be stored in a binary32 float as 101.0110001100110011001102 (which is 5.3874998092651367187510): this is 0.0000036% from the original value (but, as discussed above, the "original value" was already a pretty lousy representation of the physical quantity it represents).
Knowing that a binary32 float stores only 7 decimal digits of precision, our compiler knows for certain that everything from the 8th digit onwards is definitely false precision and therefore must be ignored in every case—thus, provided that our input value didn't require more precision than that (and if it did, binary32 was obviously the wrong choice of format), this guarantees a return to a decimal value that looks just as round as that from which we started: 5.38750010. However, we should really apply domain knowledge at this point (as we should with any storage format) to discard any further false precision that might exist, such as those two trailing zeroes.
86.210 would be stored in a binary32 float as 1010110.001100110011001102 (which is 86.199996948242187510): this is also 0.0000036% from the original value. As before, we then ignore false precision to return to our original input.
Notice how the binary representations of the numbers are identical, except for the placement of the radix point (which is four bits apart):
101.0110 00110011001100110
101 0110.00110011001100110
This is because 5.3875 × 24 = 86.2.
As an aside: being European (albeit British), I also have a strong aversion to imperial units of measurement—handling values of different scales is just so messy. I'd almost certainly store masses in SI units (e.g. kilograms or grams) and then perform conversions to imperial units as required within the presentation layer of my application. Plus rigidly adhering to SI units might one day save you from losing $125m.
I’d be tempted to store it in a metric unit, as they tend to be simple decimals and not complex values like pounds and ounces. That way, you can just store the one value (i.e. 103.25 kg) rather than the pounds–ounces equivalent, and it’s easier to perform conversions.
This is something I’ve dealt with in the past. I do a lot of work on pro wrestling and mixed martial arts (MMA) websites where fighters’ heights and weights need to be recorded. They tend to be displayed as feet and inches and pounds and ounces, but I still store the values in their centimetres and kilogram equivalents, and then do the conversion when displaying on the site.
First, I had not known about how floating point numbers were inaccurate - thankfully a search latter helps me understand: Floating Point Inaccuracy Examples
I would fully agree with #eggyal - keep the data in a single format in a single column. This allows you to expose it to the application and let the application deal with the presentation of it - be it in lbs/oz, rounded up lbs, whatever.
The database should keep the raw data while the presentation layer dictates the layout.
You can use decimal data type for weight column.
decimal('weight', 8, 2); // precision = 8, scale = 2
Storage size:
Precision 1-9 5 Bytes
Precision 10-19 9 Bytes
Precision 20-28 13 Bytes
Precision 29-38 17 Bytes