MySQL character ordering: numbers before question mark - mysql

I have recently upgraded a MySQL data store from some ungodly many-years-out-of-date version to 8.0.26.
In one particular table I store dates associated with each record, but occasionally there are as-yet-unknown future dates. These have always been stored in the format YYYY-MM-??, so the field type is VARCHAR(10) rather than DATE, as would be expected if it was possible to always be exact. The field data is otherwise reliably YYYY-MM-DD.
However, queries to order this data have recently stopped working as expected, with MySQL reckoning that such an unknown date should be ordered BEFORE an exact date.
A query boils down to something like this: SELECT * FROM table WHERE date_field <= CURDATE()
(Today is 3rd December, so CURDATE is evaluating as 2021-12-03. The same occurs when using the literal string value 2021-12-03 rather than the CURDATE function, so it's definitely a sorting issue rather than clash between data types.)
In those old MySQL versions previously running, 2021-12-?? would evaluate higher/greater than an exact date like 03, and thus not be returned. This would also be expected in line with ASCII sort ordering. Now, however, any such ?? records are also returned, the question mark character apparently being sorted as before/less than a digit.
For the moment I can force the correct and expected behaviour by utilising REPLACE in my query, but this is process-heavy, ugly and inconvenient: SELECT * FROM table WHERE REPLACE(date_field , '??', '99') <= CURDATE()
Can anyone shed some light on why this is occurring and how I might correct it? It is presumably a MySQL bug, given the standard ASCII ordering and the previous experience (of many years standing) of it working correctly?
EDIT: Thanks to the initial replies pointing me to collation. The database uses almost entirely plain English with only occasional accents (etc), so I've rarely had to touch the default settings in the past.
As per ProGu and Álvaro González's responses, I've begun digging around and test queries without the real table/database involved do indeed return as suggested. However, as soon as I attempt to run anything on the real table, it's still not behaving as expected.
The table is on InnoDB, and all tables and (textual) fields across the database are utf8mb4/utf8mb4_0900_ai_ci. I have tried forcing the collation both at query level and by changing the actual table and field collation, yet that pesky 2021-12-?? is always returned, no matter which I choose. I have attempted various query formats to no avail:
SELECT * FROM table WHERE date_field <= CURDATE() ORDER BY date_field COLLATE utf8mb4_0900_ai_ci DESC
SELECT * FROM table WHERE date_field COLLATE utf8mb4_0900_ai_ci <= CURDATE() COLLATE utf8mb4_0900_ai_ci ORDER BY date_field COLLATE utf8mb4_0900_ai_ci DESC
Test based on Álvaro's code, correctly returning 2021-12-03:
with sample_data (sample_value) as (
select '2021-12-??'
union all select '2021-12-03'
)
select *
from sample_data
where sample_value <= CURDATE()
order by sample_value COLLATE utf8mb4_0900_ai_ci DESC LIMIT 1;
Is my collation inexperience showing; have I missed something really obvious?
EDIT 2
All tables and all text columns (plus connection) are already set to utf8mb4 and utf8mb4_0900_ai_ci.
See this DB Fiddle, which also incorrectly returns 2021-12-?? as apparently less than the comparison value (current date, 2021-12-08). I can find no collation that returns the real smaller value (2021-10-31 in the sample data).
Going back to Rick's initial reply:
SELECT "2021-12-??" < "2021-12-03"
This returns 1, i.e. that 03 IS greater than ??. Why? ASCII ordering is clear that digit characters come before - less than - the question mark character.
As in the original version of my post, it seems to me that MySQL is getting character ordering wrong when it is using digits as string rather than int.
Compare:
SELECT "?" < "0"; = 1
SELECT "?" < 0; = 0

This is a collation issue. You're probably relying on the default collation and that has changed.
You can change the collation at query level to figure out which ones suits your need and then adjust the table or column collation accordingly:
with sample_data (sample_value) as (
select '?'
union all select '0'
)
select *
from sample_data
order by sample_value COLLATE utf8mb4_bin;
Result
0
?
with sample_data (sample_value) as (
select '?'
union all select '0'
)
select *
from sample_data
order by sample_value COLLATE utf8mb4_0900_as_cs;
Result
?
0
Demo
Please note I mean collation and not encoding. You should be able to keep your current encoding if it isn't UTF-8.
Edit #1: these snippets are only a tool to decide which collation to choose. The fix to the problem is not to add a random ORDER BY clause at the end of your query, the fix is to change the table (or column) collation:
Edit #2:
where sample_value <= CURDATE() seemingly ignores table collation, but that's probably due to automatic casting from date type. If you force a cast to text things change:
where sample_value <= cast(curdate() as char(10))
Demo
My advice is that you first set a good known default collation everywhere (tables, connection...). It's possible that that will fix all issues.

I think you will need to change the code a little. It is unclear which of the cases below apply to your code.
mysql> SELECT "2021-12-??" < "2021-12-03";
+-----------------------------+
| "2021-12-??" < "2021-12-03" |
+-----------------------------+
| 1 |
+-----------------------------+
mysql> SELECT "2021-12-??" < CURDATE();
ERROR 1525 (HY000): Incorrect DATE value: '2021-12-??'
mysql> SELECT "2021-12-??" < CAST(CURDATE() AS CHAR);
+----------------------------------------+
| "2021-12-??" < CAST(CURDATE() AS CHAR) |
+----------------------------------------+
| 1 |
+----------------------------------------+
mysql> SELECT DATE(NOW()) <= CAST(CURDATE() AS CHAR);
+----------------------------------------+
| DATE(NOW()) <= CAST(CURDATE() AS CHAR) |
+----------------------------------------+
| 1 |
+----------------------------------------+

Related

MySQL MAX and MIN on Varchar

I have a table that contains 3 columns; day_id, start_date, end_date. start_date and end_date are varchar(8) in a format like this HH:II:SS. Sometimes dates can go over 24h in order to represent that something happened day after, for example: 25:20:01 is 01:20:01 but in a new day. day_id is not unique, it repeats. I need to get first and last event of a day, and this is my code:
SELECT day_id,
MIN(start_date) as start_time,
MAX(end_date) as end_date
FROM events WHERE day_id IN ('day_1', 'day_2', 'day_3')
GROUP BY day_id ORDER BY start_time ASC
It works as intended but I can't figure out why, how does MySQL know that 25:01:45 is larger than 20:21:09 since they are both varchars? The whole table is in utf8mb4_0900_ai_ci collation, running on MySQL server version 8.
It is a string comparison and it compares characters with their ascii value as you know. But it mainly works because it represents both single digits and two digits of time parameters as two digit representation. For example-
1:20:1 is represented as 01:20:01
2:5:7 is represented as 02:05:07
So, there will never be a time where 10:02:07 will come before 2:5:7(since 1 < 2) since 2:5:7 is 02:05:07 and 1 > 0. Hence, it always works.
Sometimes dates can go over 24h in order to represent that something
happened day after, for example: 25:20:01 is 01:20:01
So, if this 25 goes over 2 digits for some reason, then you will have problems. So, use the correct datatype to store it - TIME.
how does MySQL know that 25:01:45 is larger than 20:21:09?
Databases compare strings using a collation. The default collation is alphabetical ordering.
So, MySQL knows that '25' > '20' in exactly the same way that we knows that the word 'BE' comes after 'BA' in the dictionary.

DATE type comparison in MySQL

Had a bit unintuitive case right now with MySQL:
the query contains where clause with comparison: WHERE t.date = '2016-12-31' (t.date-s datatype is DATE(!)).. And it returns no records on execution. But the query: WHERE t.date > '2016-12-31' - returns the records with date equals '2016-12-31' among other records! The record for 2016-12-31 also showed up in case I've used BETWEEN '20161231' AND '20170101'. Tried formattings, type changes - nothing helped. After some time spent on searching for cause I did the following: updated the record's date column manually, SETting it to '2016-12-31'. After this action WHERE t.date = '2016-12-31' started to work as expected.
Probably I'm missing something, wondering what can cause such behavior.
Update
date is DATE, not DATETIME
After doing manual update I can't reproduce the mentioned behavior again: now any type of comparison(=, DATE(..)=, STRCMP) - works as it should!
Update 2
For 2016-11-30 and 2016-09-30(end of months!) found the same behavior! Won't update the record manually for now to test the suggestions I get here.
Update 3
I've also run OPTIMIZE TABLE on the table with that date column to rebuild indexes for elimination any problems with corruption.
Update 4
Here is more:
if I check HEX values for the date field for incorrect fields(end of month) I get wrong values!
SELECT HEX(t.date) FROM table t WHERE t.date BETWEEN DATE('20160930') AND DATE('20161001');
Returns:
323031362D31302D3030
323031362D31302D3031
SELECT HEX(DATE('20160930'));
Returns:
323031362D30392D3330
And 323031362D30392D3330 != 323031362D31302D3030
SELECT X'323031362D31302D3030';
And it returns:
2016-10-00, NOT 2016-09-30!
For the value that I've updated manually - HEX is same.
But what can cause such difference?
Try forcing the format using
WHERE date(t.date) = '2016-12-31'
or
WHERE date(t.date) = str_to_date( '2016-12-31', '%Y-%m-%d')
or based on your test
WHERE date(t.date) = str_to_date( '20161231', '%Y%m%d')
After some investigation I've found the problem and its not related directly to the date comparison in MySQL. I'll post it here in case anyone is stuck at such case.
I've found that the problem was with selecting results in IDE (in my case DataGrip): the value for date field in database was 2016-10-00 and select was returning 2016-09-30! That was confusing.. But after the 00 DAY was found - it was relatively easy to find the cause of it: CURDATE() - 1 (in my case there should have been: CURDATE() - INTERVAL 1 DAY). Don't ever use date related functionality without specific functions like INTERVAL!!
Thanks to everyone who supported the question, sorry for confusion, I was confused too and found the answer only after several steps.

SQL "BETWEEN" request not acting as I want

I'm having some issues with my SQL request.
Here's the request :
SELECT SUBSTR(`Date`, 1, 11) AS `format_date`
FROM table
HAVING `format_date` BETWEEN '07/06/2016' AND '16/06/2016'
When I run the request I get not wanted results like "07/11/2014". After doing some tests it looks like the request is only taking the day in consideration but I can't really figure out why. Any ideas ?
07/11/2014is between the two given strings. You are comparing strings, not dates, but you are getting exactly what you ask for.
This is what you compare:
"07/0" < "07/1" < "16/0"
Try comparing actual dates, or format your string so that you can use them (YYYY/MM/DD).
This question (and its accepted answer) should help you convert your strings to useable dates:
how-to-convert-a-string-to-date-in-mysql
You can then copmpare real dates with each other.
The reason what you get wrong result is clear already, you can try following;)
SELECT SUBSTR(`Date`, 1, 11) AS `format_date`
FROM table
HAVING STR_TO_DATE(`format_date`, '%d/%m/%Y') BETWEEN STR_TO_DATE('07/06/2016', '%d/%m/%Y') AND STR_TO_DATE('16/06/2016', '%d/%m/%Y')
And I think you should change HAVING to WHERE.
Your problem can be summed up in the following expression...
SELECT '8' BETWEEN '7' AND '77';
+--------------------------+
| '8' BETWEEN '7' AND '77' |
+--------------------------+
| 0 |
+--------------------------+
There are 3 problems to address.
The simplest is that you have wrongly used HAVING instead of WHERE.
You have converted dates into strings in the form dd/mm/yyyy. The problem here is that text (strings) do NOT behave like numbers e.g. in text 1 and 10 are sorted before 2 or 3. Due to this you are asking for TEXT between STRINGS THAT LOOK LIKE DATES but it will not behave thay way because it requires handling dates as numeric.
Between is a very poor way to handle date ranges. A far more reliable way is to use a combination of >= and < as seen below.
Example
SELECT SUBSTR(Date, 1, 11) AS `format_date`
FROM table
WHERE `date` >= '2016-06-06'' AND `date` < '2016-06-17'

Need to combine timestamp field date with varchar field time

Im trying to figure out a way to fix a database schema issue.
In column 1 a y-m-d H:i:s date is stored (timestamp field)
col1 = 2009-11-12 00:00:00
In column 2 a time is stored (varchar)
col2 = 15:48
I'm thinking that storing it in one column would be more efficient than separately, so I'm trying to make column 3 a datetime field
col3 = 2009-11-12 15:48:00
Unless keeping it original is fine.
Yes, definitely use one field, you can get just the date or time from it later if you need. I believe you can run the following query to update col3 with the correct datetimes.
UPDATE tablename
SET col3 = CAST(LEFT(col1, 10) + " " + col2 + ":00", DATETIME)
If you don't have anything accessing these old fields (col1 and col2), you should get rid of them for clarity. If you do, it is going to be tricky decided whether or not to maintain two fields for the same data.
Addtime should do what you need
mysql> select addtime('2012-05-05 00:00:00', '11:12');
+-----------------------------------------+
| addtime('2012-05-05 00:00:00', '11:12') |
+-----------------------------------------+
| 2012-05-05 11:12:00 |
+-----------------------------------------+
1 row in set (0.00 sec)
Contrary to the other answers... I wouldn't immediately suggest combining these columns.
Consider how the columns are going to be queried - in my experience, efficient queries are more important than disk space efficiency - as such, if you're want to select rows based on date (ignoring time) and/or time (ignoring date) you would want these in separate columns. Whilst you can get the date from a datetime column, if you have lots of rows, doing that on each row before running a query would be really inefficient. (For example... consider this SO question)

MySQL compare date with timestamp

I have a VARCHAR field completion_date that contains a timestamp date (ex 1319193919). Is it possible to run a query on such a field that compares it with NOW() ? I'm trying:
SELECT * FROM (`surveys`) WHERE `active` = '1' AND `archived` = '0' AND `completion_date` > 'NOW()'
But the results are not really what I'm expecting, is this cause of the VARCHAR? If so, what kind of date field am I better off using? The data must remain a Linux timestamp.
Convert NOW() to a timestamp using UNIX_TIMESTAMP()
SELECT *
FROM (`surveys`)
WHERE `active` = '1' AND `archived` = '0' AND `completion_date` > UNIX_TIMESTAMP(NOW())
Also, remove the quotes you had around 'NOW()'
mysql> SELECT UNIX_TIMESTAMP(NOW());
+-----------------------+
| UNIX_TIMESTAMP(NOW()) |
+-----------------------+
| 1319288622 |
+-----------------------+
N.B. In case you need it, the inverse of this function is FROM_UNIXTIME() to convert a timestamp into the default MySQL DATETIME format.
As mentioned in comments below, if you have the ability to make the change, it is recommended to use a real DATETIME type instead of VARCHAR() for this data.
A Linux timestamp can easily be stored in a BIGINT (or an UNSIGNED INT), which would make the type of comparisons you're trying to do possible. A VARCHAR is going to do a lexical, not numeric, comparison and which is NOT what you want. Using a BIGINT in conjunction with converting NOW() with UNIX_TIMESTAMP() should get you what you want.
It might even be better to store it using a DATETIME data type and do the conversion when you select the data. Storing it as a DATETIME future proofs your application in the event that you move to or add a different platform where a Linux timestamp isn't appropriate. Then you only need to modify your code, not convert your data to have it continue to work. I'm working on a project now where dates were stored as character data and it's been no end of problems getting the old data into shape to use with the new application, though you might experience fewer problems than us because you're storing a time stamp, not a formatted date, with its attendant typos.