Using char index to find numeric values - mysql

I have a column on a mysql table that stores mostly numeric values, but sometimes strings. It's defined as VARCHAR(20). There is an index on this column for the first four characters.
ADD INDEX `refNumber` USING BTREE(`refNumber`(4));
Since the field is mostly numeric, it is useful for the user to be able to query for values that fall within an numeric range (e.g., 100 to 2000). If I use a numeric comparison, this index is not used.
WHERE refNumber BETWEEN 100 AND 2000
If I use a string comparison, I get some values I don't want (e.g., 10000 comes back when querying for a range of 100 to 2000).
WHERE refNumber BETWEEN '100' AND '2000'
Is there a good solution to this?
Note: there are some values that are recorded with zeros padded on the front like 0234, which should be returned when looking for values between 100 and 2000.

Three possibilities
1) Separate the numeric values into their own column.
2) If you MUST keep things as they are, decide on a maximum length for the numbers, zero- or blank-pad the numbers to that length.
3) I don't know if MySQL supports function-based indexes, but that might be an option. if so, write a function that returns the extracted numeric value and use that as the basis of the index.

You can try using the string comparison first, so the index is used, and still do the numeric comparison afterwards. It shouldn't slow things too much, since the second filter will only apply to a small subset of the rows.
WHERE refNumber BEETWEEN '100' AND '2000' AND CAST(refNumber as SIGNED INTEGER) BEETWEEN 100 AND 2000

Related

What are the downsides of making a numeric column a string in a database?

I have members on my team who are tossing around the idea to make every column in the database a string including numeric columns.
I know that sorting becomes an issue.
What are the other downsides of making a numeric column a string?
The major issue is that users can put broken data into the columns -- data that is not numeric. That is simply not possible with the correct type. Although you could add a check constraint for every numeric column, that seems like a lot of work.
The scenario is: You have a query that works and has worked for a long time. All of a sudden, someone puts a non-numeric value into the column. The query breaks. And because the query was (probably) using implicit conversion, it is really hard to tell where the problem is.
Let me just say: I am speaking from experience here.
Other problems are:
Comparisons don't work as expected: '0' <> '0.0'.
Comparisons don't work as expected: '9' > '100'.
Comparisons don't work as expected: '.1' < '0.01'.
Sorting doesn't work as expected.
The code is filled with (unnecessary and typically implicit) conversions.
Some databases, such as SQL Server, overload operators so '1' + '1' <> '2'.
Some databases overload operators, so current_timestamp + 1 is valid but current_timestamp + '1' is invalid.
A comparison in a query can affect index usage. So, strcol = 1 ends up converting strcol to a number, which typically precludes the use of an index. On the other hand, intcol = '1' ends up converting the constant to a number, which still allows the index to be used. I do not recommend mixing types in comparisons, though.
Space is a wash, because in many cases the string representation might be smaller than the number representation. It depends in that case. There is a slight hit on indexing, because fixed length keys are usually more efficient.
If you mix types, things get worse -- because that affects the optimizer.
Some things that are composed of numbers are not necessarily numeric. You can usually tell the difference easily: does it make sense to perform arithmetic operations on the value? Or another indicator: do leading zeros make sense?
it will take more space
indexes will also take more space and be less efficient
ordering will not work correctly (e.g. "10" < "2")
any numeric operations will not work correctly (e.g. 10% more than x)
having said all this, fields like SSN, phone number, etc. that appear numeric but are not really numbers should be strings.
In general, if the numeric column is an ID and never used for calculations, it is probably OK. If the numbers are "measures", like amount or quantity, I would not recommend it as you most likely would want to do calculations at some point (like SUM, AVG, etc)
I got this type of issue to an externally designed db faced lots of challenges:
Conversion of date, numeric columns during querying
Indexing took more space and has slower performance

Impact on MySQL SELECT query if we search a string with length greater than column's length

I have a column field "name" with type VARCHAR and length as 80 .
If i search for a string with length greater than 80 , what is the behaviour of SELECT operation in this case ?
Apart from round trip to DB , will it scan whole table or just perform a return as the length is already greater than 80 ?
SELECT * FROM TABLE WHERE name = "longgg... ...string" ;
I need this knowledge because in the existing codebase there is a variable which is used in all layers of MVC . This variable is meant for a different column with different length . For saving time and code redundancy I just want to use same variable which has validation for a bigger length as compared to 80.
After the discussion , I am going to add length validation instead of
depending on db validation as all the rows are scanned !
One more wise thought in the comments , if the column is indexed whole table is not scanned .Verified same using EXPLAIN :
All rows are scanned as there is no index. CHAR/VARCHAR index stores length as well (included in explain/key_len). So if index created on the column, the query should not compare anything as based by the index SQL should understand that there is no rows found by criteria.
I don't have a documentation reference for this, but from my observations testing what will happen is that MySQL will cast both sides of the equality expression to the length of the string literal on the RHS. That is, MySQL will execute the following query:
SELECT *
FROM yourTable
WHERE CAST(name AS CHAR(100)) = CAST('longgg... ...string' AS CHAR(100));
-- assuming it has a length of 100
This comparison would generally fail, since MySQL will not pad a string on the LHS which is less than 100 characters, meaning that the lengths would not generally even match.

Incorrect decimals appearing in SUM MySQL

I have the following SQL query.
SELECT SUM(final_insurance_total) as total
FROM `leads`
GROUP BY leads.status
I have a single row of data in the lead table with a value for final_insurance_total of 458796. The data type for final_insurance_total is float.
For some reason, MySQL is summing a single row as "458796.375".
If I change the query to
SELECT (final_insurance_total) as total
FROM `leads`
GROUP BY leads.status
the correct value is returned. What in the world is going on?
The FLOAT and DOUBLE types in MySQL (as well as in other databases and programming language runtimes) are represented in a special way, which leads to the values stored being approximations, not exact values. See MySQL docs, as well as general information on floating-point arithmetics.
In order to store and operate with exact values, use the type DECIMAL (see https://dev.mysql.com/doc/refman/5.1/en/precision-math-decimal-characteristics.html).
EDIT: I have run some tests, and while floating-point precision errors are quite common, this particular one looks to be specific to the implementation of SUM() in MySQL. In other words, it is a bug that has been there for a long time. In any case, you should use DECIMAL as your field type.
FLOAT does not guarantee precision where any calculation is made. If you use a simple SELECT, no calculation is made, so you get the original value. But if you use SUM(), even with one row, at least one addition is executed (0 + current_value).
Do you really need FLOAT? For example, if you have 2 decimal digits, you could use INT and multiply all values by 100 before all INSERTs. When SELECTing results, you will divide by 100.
If the user is not a sysadmin and cannot change the datatype of the field such as FLOAT, the user can use CAST to produce the desired output.

MySQL CAST() causes significant performance hit

So I ran the following in the MySQL console as a control test to see what was holding back the speed of my query.
SELECT bbva_deductions.ded_code, SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS tt_emplid,
bbva_job.paygroup, bbva_job.file_nbr, bbva_deductions.ded_amount
FROM bbva_deductions
LEFT JOIN bbva_job
ON CAST(SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS UNSIGNED) = bbva_job.emplid LIMIT 500
It took consistently around 4 seconds to run. (seems very high for only 500 rows). Simply removing the CAST within the JOIN decreased that to just 0.01 seconds.
In this context, why on earth is CAST so slow?
Here is the output of an EXPLAIN for this query:
And the same for the query without a CAST:
EXPLAIN EXTENDED:
As documented under How MySQL Uses Indexes:
MySQL uses indexes for these operations:
[ deletia ]
To retrieve rows from other tables when performing joins. MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. For example, VARCHAR(10) and CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.
Comparison of dissimilar columns may prevent use of indexes if values cannot be compared directly without conversion. Suppose that a numeric column is compared to a string column. For a given value such as 1 in the numeric column, it might compare equal to any number of values in the string column such as '1', ' 1', '00001', or '01.e1'. This rules out use of any indexes for the string column.
In your case, you are attempting to join on a comparison between a substring (of a string column in one table) and a string column in another table. An index can be used for this operation, however the comparison is performed lexicographically (i.e. treating the operands as strings, even if they represent numbers).
By explicitly casting one side to an integer, the comparison is performed numerically (as desired) - but this requires MySQL to implicitly convert the type of the string column and therefore it is unable to use that column's index.
You have hit this road bump because your schema is poorly designed. You should strive to ensure that all columns:
are encoded using the data types that are most relevant to their content; and
contain only a single piece of information — see Is storing a delimited list in a database column really that bad?
At the very least, your bbva_job.emplid should be an integer; and your bbva_deductions.employee_id should be split so that its parts are stored in separate (appropriately-typed) columns. With appropriate indexes, your query will then be considerably more performant.

MySQL datatypes for business applications?

Good day, I am confused with the datatype for MySQL.
I am using decimal as apparently it is the safest bet for accuracy in a business application. However, I find that when fields are returned I have values of 999999999.99, where my datatype is DECIMAL(10,2). So the actual value has overflowed outside the (10, 2) parameter.
Would it be correct that even though I have specified 10 places before the comma and 2 places after the comma. MySQL still stores the complete number?
Also would it be possible to turn off the maximum amount of digits displayed before and after the comma?
Would it be correct that even though I have specified 10 places before the comma and 2 places after the comma. MySQL still stores the complete number?
No, it wouldn't.
First, you specified 10 digits altogether; two are to the right of the decimal point, and eight are to the left.
Standard SQL requires that DECIMAL(5,2) be able to store any value with five digits and two decimals, so values that can be stored in the salary column range from -999.99 to 999.99.
Second, MySQL will silently convert the least significant digits to scale if there are more than two. That will probably look like MySQL truncates, but the actual behavior is platform-dependent. It will raise an error if you supply too many of the most significant digits.
Finally, when you're working with databases, the number of digits displayed has little to do with what a data type is or with what range of values it stores.