MySQL: parse and cast strings which contain numbers with units - mysql

I have a table that has a column holding string values that are numbers and units. The values contain a numerical value in the prefix composed of integers and one decimal.
Some examples of these values would be following:
"16 GB", "8.5gb", "15.99345 GHz", "25L"
Is there a way I can use the cast function to first parse the string values that contain numbers and decimals and only do the cast on that portion of the values?
This is what I had in mind
select * from my_table
where cast( numparse( my_column ) as signed ) > 10
Thanks in advance, I'm fairly new to SQL so any help would be appreciated.

Yes, you could write a stored procedure that does some sort of string parsing, or use a regex as in #ladd2025's answer...
But then you'd be redoing this conversion on every query. There's the cost of the conversion, but it also means you cannot take advantage of indexing. A query like where parse_the_thing( thing ) > 10 has to do a full table scan. Whereas if thing were an indexed number to begin with where thing > 10 is a very fast indexed query. This a problem with storing "formatted" information, you have to strip the formatting every time you want to do something with it.
You'd be far better off normalizing your stored data to store the magnitude as a numeric data type such as bigint, double, or numeric, and the unit as an enum column. Or consider if it makes sense to store all these different units in the same table; does it make sense to compare 8.5 gb with 15.99 Ghz?
8.5gb stored in bytes would become the bigint 8,500,000,000 (the exact value depends on whether it's 1000 bytes or 1024 bytes) with the unit bytes. 15.99345 GHz might become the bigint 15,993,450,000 with the unit Hz. And so on.
You can accomplish this by adding the new columns to your table, and doing the update to convert from the strings to the units and quantity. And then change whatever is inputting the values to do the same. You can continue to store the original human formatted string if you like, but you might be better off not and applying the formatting as needed.
This makes your queries much simpler, less chance of bugs. And they can take advantage of indexing, so they'll be much, much faster.

You could use REGEXP_REPLACE:
SELECT *
FROM tab
WHERE CAST(REGEXP_REPLACE(my_column, '[^0-9/.]', '') AS signed) > 10;
DBFiddle Demo

Just use the CAST() function. If you're casting to a numeric type, it will just parse the prefix and ignore the rest.
mysql> select cast('12.45gb' as signed);
+---------------------------+
| cast('12.45gb' as signed) |
+---------------------------+
| 12 |
+---------------------------+

Related

When to use float vs decimal

I'm building this API, and the database will store values that represent one of the following:
percentage
average
rate
I honestly have no idea how to represent something that the range is between 0 and 100% in numbers. Should it be
0.00 - 1.00
0.00 - 100.00
any other alternative that I don't know
Is there a clear choice for that? A global way of representing on databases something that goes from 0 to 100% percent? Going further, what's the correct that type for it, float or decimal?
Thank you.
I'll take the opposite stance.
FLOAT is for approximate numbers, such as percentages, averages, etc. You should do formatting as you display the values, either in app code or using the FORMAT() function of MySQL.
Don't ever test float_value = 1.3; there are many reasons why that will fail.
DECIMAL should be used for monetary values. DECIMAL avoids a second rounding when a value needs to be rounded to dollars/cents/euros/etc. Accountants don't like fractions of cents.
MySQL's implementation of DECIMAL allows 65 significant digits; FLOAT gives about 7 and DOUBLE about 16. 7 is usually more than enough for sensors and scientific computations.
As for "percentage" -- Sometimes I have used TINYINT UNSIGNED when I want to consume only 1 byte of storage and don't need much precision; sometimes I have used FLOAT (4 bytes). There is no datatype tuned specifically for percentage. (Note also, that DECIMAL(2,0) cannot hold the value 100, so technically you would need DECIMAL(3,0).)
Or sometimes I have used a FLOAT that held a value between 0 and 1. But then I would need to make sure to multiply by 100 before displaying the "percentage".
More
All three of "percentage, average, rate" smell like floats, so that would be my first choice.
One criterion for deciding on datatype... How many copies of the value will exist?
If you have a billion-row table with a column for a percentage, consider that TINYINT would take 1 byte (1GB total), but FLOAT would take 4 bytes (4GB total). OTOH, most applications do not have that many rows, so this may not be relevant.
As a 'general' rule, "exact" values should use some form of INT or DECIMAL. Inexact things (scientific calculations, square roots, division, etc) should use FLOAT (or DOUBLE).
Furthermore, the formatting of the output should usually be left to the application front end. That is, even though an "average" may compute to "14.6666666...", the display should show something like "14.7"; this is friendlier to humans. Meanwhile, you have the underlying value to later decide that "15" or "14.667" is preferable output formatting.
The range "0.00 - 100.00" could be done either with FLOAT and use output formatting or with DECIMAL(5,2) (3 bytes) with the pre-determination that you will always want the indicated precision.
I would generally recommend against using float. Floating point numbers do represent numbers in base-2, which causes some (exact) numbers to be round-up in operations or comparisons, because they just cannot be accurately stored in base-2. This may lead to suprising behaviors.
Consider the following example:
create table t (num float);
insert into t values(1.3);
select * from t;
| num |
| --: |
| 1.3 |
select * from t where num = 1.3;
| num |
| --: |
Base-2 comparison of number 1.3 fails. This is tricky.
In comparison, decimal provide an accurate representation of finite numbers within their range. If you change float to decimal(2, 1) in the above example, you do get the expected results.
I recommend using decimal(5,2) if you're going to store it in the same way you'll display it since decimal is for preserving the exact precision. (See https://dev.mysql.com/doc/refman/8.0/en/fixed-point-types.html)
Because floating-point values are approximate and not stored as exact values, attempts to treat them as exact in comparisons may lead to problems. They are also subject to platform or implementation dependencies.
(https://dev.mysql.com/doc/refman/8.0/en/floating-point-types.html)
A floating-point value as written in an SQL statement may not be the same as the value represented internally.
For DECIMAL columns, MySQL performs operations with a precision of 65 decimal digits, which should solve most common inaccuracy problems.
https://dev.mysql.com/doc/refman/8.0/en/problems-with-float.html
Decimal :
In case of financial applications it is better to use Decimal types because it gives you a high level of accuracy and easy to avoid rounding errors
Double :
Double Types are probably the most normally used data type for real values, except handling money.
Float :
It is used mostly in graphic libraries because very high demands for processing powers, also used situations that can endure rounding errors.
Reference: http://net-informations.com/q/faq/float.html
Difference between float and decimal are the precision. Decimal can 100% accurately represent any number within the precision of the decimal format, whereas Float, cannot accurately represent all numbers.
Use Decimal for e.g. financial related value and use float for e.g. graphical related value
mysql> create table numbers (a decimal(10,2), b float);
mysql> insert into numbers values (100, 100);
mysql> select #a := (a/3), #b := (b/3), #a * 3, #b * 3 from numbers \G
*********************************************************************
#a := (a/3): 33.333333333
#b := (b/3): 33.333333333333
#a + #a + #a: 99.999999999000000000000000000000
#b + #b + #b: 100
The decimal did exactly what's supposed to do on this cases, it
truncated the rest, thus losing the 1/3 part.
So for sums, the decimal is better, but for divisions, the float is
better, up to some point, of course. I mean, using DECIMAL will not give
you "fail-proof arithmetic" in any means.
I hope this will help.
In tsql:
Float, 0.0 store as 0 and it dont require to define after decimal point digit, e.g. you dont need to write Float(4,2).
Decimal, 0.0 store as 0.0 and it has option to define like decimal(4,2), I would suggest 0.00-1.00, by doing this you can calculate value of that percent without multiply by 100, and if you report then set data type of that column as percent as MS Excel and other platform view like 0.5 -> 50%.

SQL string literal hexadecimal key to binary and back

after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)
In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.

MySQL round in query, wrong result

I have a question about a query that I'm running on a MySQL Server (v5.5.50-0+deb8u1).
SELECT 12 - (SELECT qty FROM Table WHERE id = 5213) AS Amount
so Amount value is 12 - 8,5500000000000007 = 3.4499999999999993
But if I run the query:
SELECT qty FROM Table WHERE id = 5213
it returns 8.55 that is the correct number written in the record, so I was expecting that the first querty returned 3.45.
The "qty" column in the table "Table" is a DOUBLE.
How is it possibile? How can I get the right answer from the query?
thanks in advance
Well that's just the way floating numbers are.
Floating-point numbers sometimes cause confusion because they are
approximate and not stored as exact values. A floating-point value as
written in an SQL statement may not be the same as the value
represented internally.
This statement holds true for many programming languages as well. Some numbers don't even have an exact representation. Here's something from the python manual
The problem is easier to understand at first in base 10. Consider the
fraction 1/3. You can approximate that as a base 10 fraction:
0.3 or, better,
0.33 or, better,
0.333 and so on. No matter how many digits you’re willing to write down, the result will never be exactly 1/3, but will be an
increasingly better approximation of 1/3.
In the same way, no matter how many base 2 digits you’re willing to
use, the decimal value 0.1 cannot be represented exactly as a base 2
fraction. In base 2, 1/10 is the infinitely repeating fraction
So in short generally doing is float1 = float2 type of comparison is a bad idea but everyone keeps forgetting it.
You can define 'qty' column as decimal(10,2)

smallest storage of integer array in mysql?

I have a table of user entries, and for every entry I have an array of (2-byte) integers to store (15-25, sporadically even more). The array elements will be written and read all at the same time, it is never needed to update or to access them individually. Their order matters. It makes sense to think of this as an array object.
I have many millions of these user entries and want to store this with the minimum possible amount of disk space. I'm however struggling with MySQL's lack of Array datatype.
I've been considering the following options.
Do it the MySQL way. Make a table my_data with columns user_id, data_id and data_int. To make this efficient, one needs an index on user_id, totalling well over 10 bytes per integer.
Store the array in text format. This takes ~6.5 bytes per integer.
making 35-40 columns ("enough") and having -32768 be 'empty' (since this value cannot occur in my data). This takes 3.5-4 bytes per integer, but is somewhat ugly (as I have to impose a strict limit on the number of elements in the array).
Is there a better way to do this in MySQL? I know MySQL has an efficient varchar type, so ideally I'd store my 2-byte integers as 2-byte chars in a varchar (or a similar approach with blob), but I'm not sure how to do that. Is this possible? How should this be done?
You could store them as separate SMALLINT NULL columns.
In MyISAM this this uses 2 bytes of data + 1 bit of null indicator for each value.
In InnoDB, the null indicators are encoded into the column's field start offset, so they don't take any extra space, and null values are not actually stored in the row data. If the rows are small enough that all the offsets are 1 byte, then this uses 3 bytes for every existing value (1 byte offset, 2 bytes data), and 1 byte for every nonexistent value.
Either of these would be better than using INT with a special value to indicate that it doesn't exist, since that would be 4 bytes of data for every value.
See NULL in MySQL (Performance & Storage)
The best answer was given in the comments, so I'll repost it here with some use-ready code, for further reference.
MySQL has a varbinary type that works really well for this: you can simply use PHP's pack/unpack functions to convert them to and from binary form, and store that binary form in the database using varbinary. Example code for the conversion is below.
function pack24bit($n) { //input: 24-bit integer, output: binary string of length 3 bytes
$b3 = $n%256;
$b2 = $n/256;
$b1 = $b2/256;
$b2 = $b2%256;
return pack('CCC',$b1,$b2,$b3);
}
function unpack24bit($packed) { //input: binary string of 3 bytes long, output: 24-bit int
$arr = unpack('C3b',$packed);
return 256*(256*$arr['b1']+$arr['b2'])+$arr['b3'];
}

What data type could I use for an ID number that has a length of 13 digits in SQL Server 2008?

Normally, the INTEGER data type would suffice, but being in South Africa the ID numbers have a length of 13 and the INTEGER data type only goes up to 10. I am not fond of using characters like VARCHAR since it would not restrict the input ID number to integer values only. I only solution I see (other to using VARCHAR) is to use DECIMAL. Only problems that I see are that I can't restrict the max size like in VARCHAR and the data input could have ',' and '.' Any comments?
Just use BIGINT, it ranges from -9223372036854775808 to 9223372036854775807 which should be enough for your application.
Assuming that you're referring to South African national ID numbers, which according to Wikipedia always have 13 digits, then I would go for CHAR(13) with a CHECK constraint (a CLR user-defined data type might also be an option).
The main reason is that the 'number' is not a number, it's an ID. You can't add, subtract, multiply etc. the values so there is no benefit in using a numeric data type. Furthermore, the ID is composed of components that have their own meaning, so being able to parse them out is presumably important (and easier when using character data types).
In fact, depending on how you use this data, you could also add columns that store the individual components of the ID (DOB, sequence, citizenship), either as computed columns or real columns. This could be convenient for querying and reporting (and indexing), especially if you converted the DOB to a date or datetime column.
I would indeed use VARCHAR with a CHECK that matches the format. You can even be more sophisticated if there is internal validation, e.g. a check digit. Now you are all set for other countries that have an alphabetic character, or if you need to handle a leading zero.
I wouldn't use an integer unless it makes sense to do some sort of arithmetic on the field, which is almost certainly not true here.
You could use money as well, although it appears you only get 4 digits after the decimal place. The money type is 8 bytes, giving you a range from -922,337,203,685,477.5808 to 922,337,203,685,477.5807.
declare #num as money
select #num = '1,300,000.45'
select #num
Results in:
1300000.45
The parsing of commas and periods might be dependent on your specific culture settings, although I don't know that for sure.