String Compare "Logic" - language-agnostic

Can anybody please tell me why the string comparisons below deliver these results?
>>"1040"<="12000"
True
>> "1040"<="10000"
False
I've tried the string comparison in both C and Python, the result is obviously correct, I just can't figure out how the result is calculated...
P.S.: I know that comparing strings of different length is something you shouldn't do, but I'm still wondering about the logic behind the above lines ;-)

"1" is equal to "1".
"0" comes before "2" (so "1040" < "12000").
"4" comes after "0" (so "1040" > "10000").

The fancy word here describing this ordering is "lexicographical order" (and sometimes "dictionary order"). In everyday language we just refer to it as "alphabetical order". What this means is that we place first an ordering on our alphabet (A, B, ... Z, etc.) and then to compare two words over this alphabet we compare one character at a time until we find two non-equal characters in the same position and return the comparison between these two characters.
Example: The "natural" ordering on the alphabet { A, B, C, ..., Z } is that A < B < C < ... < Z. Given two words s = s_1s_2...s_m and t = t_1t_2...t_n we compare s_1 to t_1. If s_1 < t_1 we say that s < t. If s_1 > t_1 we say that s > t. If s_1 = t_1 we recurse on the words s_2...s_m and t_2...t_n. For this to work we say that the empty string is less than all non-empty strings.
In the old days, before Unicode and the like, the ordering on our symbols was just the ordering for the ASCII character codes. So then we have 0 < 1 < 2 < ... < 9 < ... < A < B < C < ... Z < ... < a < b < c < ... < z. It's more complicated in the days of Unicode, but the same principle applies.
Now, what all this means is that if we want to compare 1040 and 12000 we would use the following:
1040 compare to 12000 is equal to 040 compare to 2000 which gives 040 < 2000 because 0 < 2 so that, finally, 1040 < 12000.
1040 compare to 10000 is equal to 040 compare to 0000 is equal to 40 compare to 000 which gives 40 > 000 because 4 > 0 so that, finally, 1040 > 10000.
The key here is that these are strings and do not have a numerical meaning; they are merely symbols and we have a certain ordering on our symbols. That is, we could achieve exactly the same answer if we replaced 0 by A, 1 by B, ..., and 9 by J and said that A < B < C < ... < J. (In this case we would be comparing BAEA to BAAAA and BAEA to BCAAA. )

Think alphabetized.

The strings are compared, one character at a time, from left to right:
10000
1040
12000
There's nothing wrong with comparing strings of different lengths.

You're experiencing lexicographical ordering.
There are some generalized algorithms for this ordering in the book Elements of Programming. Search for the word lexicographical.

It compares the "numbers" on a character by character basis. In the first case, "1" == "1", but then "0" < "2" in ASCII (and as an integer) so it returns true.
In the second case, 1==1, 0==0, but 4 > 0, so it returns false.
And there's nothing wrong with comparing strings of a different length... but you should use the appropriate string comparison method.

In C, string comparisons are done character by character. In the first case, the first characters of the stings are equal, so it comes down to the second character: '0' is < '2', so "1040" < "12000". In the second case, the first two characters of the strings are equal, so the third character is the basis -- '4' > '0', so "1040" > "10000".
If you want them compared as numbers, you'll need to convert them to numbers first, then do the comparison.

To expand on the John P's answer, think of the strings as words, and read them left-to-right.
To look at it another way,
BAEA would come before BCAAA but after BAAAA

It compares each character since you are comparing strings. If you wish to compare the numbers, then make them a numerical type.

"10000" <= "1040" <= "12000" in the same way that "fabricate" <= "fact" <= "foolish".

how about making them the same length?
That would unify numbers and alphas
1040 becomes 01040
01040 < 12000 now it makes sense
maybe that is why he felt it was wrong to compare strings of different length
when the strings are numbers they should be the same length

Related

In SQL - how can I count the number of times Bit(0), Bit(1), ... Bit(N) are high for a decimal number?

I am dealing with a table of decimal values that represent binary numbers. My goal is to count the number of times Bit(0), Bit(1),... Bit(n) are high.
For example, if a table entry is 5 this converts to '101' which can be done using the BIN() function.
What I would like to do is increment a variable 'bit0Count' and 'bit2Count'
I have looked into the BIT_COUNT() function however this would only return 2 for the above example.
Any insight would be greatly appreciated.
SELECT SUM(n & (1<<2) > 0) AS bit2Count FROM ...
The & operator is a bitwise AND.
1<<2 is a number with only 1 bit set, left-shifted by two places, so it is binary 100. Using bitwise AND against you column n is either binary 100 or binary 000.
Testing that with > 0 returns either 1 or 0, since in MySQL, boolean results are literally the integers 1 for true and 0 for false (note this is not standard in other implementations of SQL).
Then you can SUM() these 1's and 0's to get a count of the occurrences where the bit was set.
To tell if bit N is set, use 1 << N to create a mask for that bit and then use bitwise AND to test it. So (column & (1 << N)) != 0 will be 1 if bit N is set, 0 if it's not set.
To total these across rows, use the SUM() aggregation function.
If you need to do this frequently, you could define a stored function:
CREATE FUNCTION bit_set(UNSIGNED INT val, TINYINT which) DETERMINISTIC
RETURN (val & (1 << which)) != 0;

How do I Query for used BETWEEN Operater for text searches in MySql database?

I have a SQL Table in that i use BETWEEN Operater.
The BETWEEN Operater selects values within range. The values can be numbers, text , dates.
stu_id name city pin
1 Raj Ranchi 123456
2 sonu Delhi 652345
3 ANU KOLKATA 879845
4 K.K's Company Delhi 345546
5 J.K's Company Delhi 123456
I have a query like this:-
SELECT * FROM student WHERE stu_id BETWEEN 2 AND 4 //including 2 & 4
SELECT * FROM `student` WHERE name between 'A' and 'K' //including A & not K
Here My Question is why not including K.
but I want K also in searches.
Don't use between -- until you really understand it. That is just general advice. BETWEEN is inclusive, so your second query is equivalent to:
WHERE name >= 'A' AND
name <= 'K'
Because of the equality, 'K' is included in the result set. However, names longer than one character and starting with 'K' are not -- "Ka" for instance.
Instead, be explicit:
WHERE name >= 'A' AND
name < 'L'
Of course, BETWEEN can be useful. However, it is useful for discrete values, such as integers. It is a bit dangerous with numbers with decimals, strings, and date/time values. That is why I encourage you to express the logic as inequalities.
In supplement to gordon's answer, one way to get what you're expecting is to turn your name into a discrete set of values:
SELECT * FROM `student` WHERE LEFT(name, 1) between 'A' and 'K'
You need to appreciate that K.K's Company is alphabetically AFTER the letter K on its own so it is not BETWEEN, in the same way that 4.1 is not BETWEEN 2 and 4
By stripping it down to just a single character from the start of the string it will work like you expect, but take cautionary note, you should always avoid running functions on values in tables, because if you had a million names, thats a million strings that mysql has to strip out to just the first letter and it might no longer be able to use an index on name, battering the performance.
Instead, you could :
SELECT * FROM `student` WHERE name >= 'A' and name < 'L'
which is more likely to permit the use of an index as you aren't manipulating the stored values before comparing them
This works because it asks for everything up to but not including L.. Which includes all of your names starting with K, even kzzzzzzzz. Numerically it is equivalent to saying number >= 2 and number < 5 which gives you all the numbers starting with 2, 3 or 4 (like the 4.1 from before) but not the 5
Remember that BETWEEN is inclusive at both ends. Always revert to a pattern of a >= b and a < c, a >= c and a < d when you want to specify ranges that capture all possible values
Compare in lexicographical order, 'K.K's Company' > 'K'
We should convert the string to integer. You can try that mysql script with CAST and SUBSTRING. I've updated your script here. It will include the last record as well.
SELECT * FROM student WHERE name CAST(SUBSTRING(username FROM 1) AS UNSIGNED)
BETWEEN 'A' AND 'K';
The script will work. Hope it will helps to you.
Here I've attached my test sample.

Best datatype to store a long number made of 0 and 1

I want to know what's the best datatype to store these:
null
0
/* the length of other numbers is always 7 digits */
0000000
0000001
0000010
0000011
/* and so on */
1111111
I have tested, INT works as well. But there is a better datatype. Because all my numbers are made of 0 or 1 digits. Is there any better datatype?
What you are showing are binary numbers
0000000 = 0
0000001 = 2^0 = 1
0000010 = 2^1 = 2
0000011 = 2^0 + 2^1 = 3
So simply store these numbers in an integer data type (which is internally stored with bits as shown of course). You could use BIGINT for this, as recommended in the docs for bitwise operations (http://dev.mysql.com/doc/refman/5.7/en/bit-functions.html).
Here is how to set flag n:
UPDATE mytable
SET bitmask = POW(2, n-1)
WHERE id = 12345;
Here is how to add a flag:
UPDATE mytable
SET bitmask = bitmask | POW(2, n-1)
WHERE id = 12345;
Here is how to check a flag:
SELECT *
FROM mytable
WHERE bitmask & POW(2, n-1)
But as mentioned in the comments: In a relational database you usually use columns and tables to show attributes and relations rather than an encoded flag list.
As you've said in a comment, the values 01 and 1 should not be treated as equivalent (which rules out binary where they would be), so you could just store as a string.
It actually might be more efficient than storing as a byte + offset since that would take up 9 characters, whereas you need a maximum of 7 characters
Simply store as a varchar(7) or whatever the equivalent is in MySql. No need to be clever about it, especially since you are interested in extracting positional values.
Don't forget to bear in mind that this takes up a lot more storage than storing as a bit(7), since you are essentially storing 7 bytes (or whatever the storage unit is for each level of precision in a varchar), not 7 bits.
If that's not an issue then no need to over-engineer it.
You could convert the binary number to a string, with an additional byte to specify the number of leading zeros.
Example - the representation of 010:
The numeric value in hex is 0x02.
There is one leading zero, so the first byte is 0x01.
The result string is 0x01,0x02.
With the same method, 1010010 should be represented as 0x00,0x52.
Seems to me pretty efficient.
Not sure if it is the best datatype, but you may want to try BIT:
MySQL, PostgreSQL
There are also some useful bit functions in MySQL.

MySQL: compare a mixed field containing letters and numbers

I have a field in the mysql database that contains data like the following:
Q16
Q32
L16
Q4
L32
L64
Q64
Q8
L1
L4
Q1
And so forth. What I'm trying to do is pull out, let's say, all the values that start with Q which is easy:
field_name LIKE 'Q%'
But then I want to filter let's say all the values that have a number higher than 32. As a result I'm supposed to get only 'Q64', however, I also get Q4, Q8 and so for as I'm comparing them as strings so only 3 and the respective digit are compared and the numbers are in general taken as single digits, not as integers.
As this makes perfect sense, I'm struggling to find a solution on how to perform this operation without pulling all the data out of the database, stripping out the Qs and parsing it all to integers.
I did play around with the CAST operator, however, it only works if the value is stored as string AND it contains only digits. The parsing fails if there's another character in there..
Extract the number from the string and cast it to a number with *1 or cast
select * from your_table
where substring(field_name, 1, 1) = 'Q'
and substring(field_name, 2) * 1 > 32

SQL BETWEEN for text vs numeric values

BETWEEN is used in a WHERE clause to select a range of data between two values.
If I am correct whether the range's endpoint are excluded or not is DBMS specific.
What I can not understand in the following:
If I have a table of values and I do the following query:
SELECT food_name
FROM health_foods
WHERE calories BETWEEN 33 AND 135;`
The query returns as results rows including calories =33 and calories =135 (i.e. range endpoints are included).
But if I do:
SELECT food_name
FROM health_foods
WHERE food_name BETWEEN 'G' AND 'O';
I do not get rows with food_name starting with O. I.e. the end of the range is excluded.
For the query to work as expected I type:
SELECT food_name
FROM health_foods
WHERE food_name BETWEEN 'G' AND 'P';`
My question is why is there such a difference for BETWEEN for numbers and text data?
Between is operating exactly the same way for numbers and for character strings. The two endpoints are included. This is part of the ANSI standard, so it is how all SQL dialects work.
The expression:
where num between 33 and 135
will match when num is 135. It will not match when number is 135.00001.
Similarly, the expression:
where food_name BETWEEN 'G' AND 'O'
will match 'O', but not any other string beginning with 'O'.
Once simple kludge is to use "~". This has the largest 7-bit ASCII value, so for English-language applications, it usually works well:
where food_name between 'G' and 'O~'
You can also do various other things. Here are two ideas:
where left(food_name, 1) between 'G' and 'O'
where food_name >= 'G' and food_name < 'P'
The important point, though, is that between works the same way regardless of data type.
Take the example of 'Orange' vs. 'O'. The string 'Orange' is clearly not equal to the string 'O', and as it is longer, it must be greater, not less than.
You could do 'Orange' < 'OZZZZZZZZZZZZZZ' though.
try this with REGEX
WHERE food_name REGEXP '^[G-O]';
this gives you all food_name wich starts by G till those who starts by O
DEMO HERE