MySQL SELECT return wrong results - mysql

I'm working with MySQL 5.7. I created a table with a virtual column (not stored) of type DATETIME with an index on it. While I was working on it, I noticed that order by was not returning all the data (some data I was expecting at the top was missing). Also the results from MAX and MIN were wrong.
After I run
ANALYZE TABLE
CHECK TABLE
OPTIMIZE TABLE
then the results were correct. I guess there was an issue with the index data, so I have few questions:
When and why this could happen?
Is there a way to prevent this?
among the 3 command I run, which is the correct one to use?
I'm worried that this could happen in the future but I'll not notice.
EDIT:
as requested in the comments I added the table definition:
CREATE TABLE `items` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned DEFAULT NULL,
`image` json DEFAULT NULL,
`status` json DEFAULT NULL,
`status_expired` tinyint(1) GENERATED ALWAYS AS (ifnull(json_contains(`status`,'true','$.expired'),false)) VIRTUAL COMMENT 'used for index: it checks if status contains expired=true',
`lifetime` tinyint(4) NOT NULL,
`expiration` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
`last_update` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `expiration` (`status_expired`,`expiration`) USING BTREE,
CONSTRAINT `ts_competition_item_ibfk_2` FOREIGN KEY (`user_id`) REFERENCES `ts_user_core` (`user_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1312459 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED
Queries that were returning the wrong results:
SELECT * FROM items ORDER BY expiration DESC;
SELECT max(expiration),min(expiration) FROM items;
Thanks

TLDR;
The trouble is that your data comes from virtual columns materialized via indexes. The check, optimize, analyze operations you are doing forces the indexes to be synced and fixes any errors. That gives you the correct results henceforth. At least until the index gets out of sync again.
Why it may happen
Much of the problems are caused by issues with your table design. Let's start with.
`status_expired` tinyint(1) GENERATED ALWAYS AS (ifnull(json_contains(`status`,'true','$.expired'),false)) VIRTUAL
No doubt this is created to overcome the fact that you cannot directly index a JSON column in mysql. You have created a virtual column and indexed that instead. It's all very well, but this column can hold only one of two values; true or false. Which means it has very poor cadinality. As a result, mysql is unlikely to use this index for anything.
But we can see that you have combined the status_expired column with the expired column when creating the index. Perhaps with the idea of overcoming this poor cardinality mentioned above. But wait...
`expiration` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
Expiration is another virtual column. This has some repercussions.
When a secondary index is created on a generated virtual column,
generated column values are materialized in the records of the index.
If the index is a covering index (one that includes all the columns
retrieved by a query), generated column values are retrieved from
materialized values in the index structure instead of computed “on the
fly”.
Ref: https://dev.mysql.com/doc/refman/5.7/en/create-table-secondary-indexes.html#json-column-indirect-index
This is contrary to
VIRTUAL: Column values are not stored, but are evaluated when rows are
read, immediately after any BEFORE triggers. A virtual column takes no
storage.
Ref: https://dev.mysql.com/doc/refman/5.7/en/create-table-generated-columns.html
We create virtual columns based on the sound principal that values generated by simple operations on columns shouldn't be stored to avoid redundancy, but by creating an index on it, we reintroduce redundancy.
Proposed fixes
based on the information provided, you don't really seem to need the status_expired column or even the expired column. An item that's past it's expiry date is expired!
CREATE TABLE `items` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned DEFAULT NULL,
`image` json DEFAULT NULL,
`status` json DEFAULT NULL,
`expire_date` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
`last_update` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `expiration` (`expired_date`) USING BTREE,
CONSTRAINT `ts_competition_item_ibfk_2` FOREIGN KEY (`user_id`) REFERENCES `ts_user_core` (`user_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1312459 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED
Simply compare the current date with the expired_date column in the above table when you need to find out which items have expired. The difference here is instead of expired being a calculated item in every query, you calculate the expiry_date once, when you create the record.
This makes your table a lot neater and queries possibly faster

Related

Django slow inner join on a table with more than 10 million records

I am using mysql with Django. I am trying to count the number of visitor_pages for a specific dealer in a certain amount of time.
I would share the raw sql query that I have obtained from django debug toolbar.
SELECT COUNT(*) AS `__count`
FROM `visitor_page`
INNER JOIN `dealer_visitors`
ON (`visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`)
WHERE (`visitor_page`.`date_time` BETWEEN '2021-02-01 05:51:00'
AND '2021-03-21 05:50:00'
AND `dealer_visitors`.`dealer_id` = 15)
The issue is that I have more than 13 million records in the visitor_pages table and about 1.5 million records in the dealer_visitor table. I have already indexed date_time. I am thinking of using a materialized view but before attempting that, I would really appreciate suggestions on how I could improve this query.
visitor_pages schema:
CREATE TABLE `visitor_page` (
`id` int NOT NULL AUTO_INCREMENT,
`date_time` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`page_id` int NOT NULL,
`dealer_visitor_id` int NOT NULL,
PRIMARY KEY (`id`),
KEY `visitor_page_page_id_246babdf_fk_web_page_id` (`page_id`),
KEY `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` (`dealer_visitor_id`),
KEY `visitor_page_date_time_06e9e9f5` (`date_time`),
CONSTRAINT `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` FOREIGN KEY (`dealer_visitor_id`) REFERENCES `dealer_visitors` (`id`),
CONSTRAINT `visitor_page_page_id_246babdf_fk_web_page_id` FOREIGN KEY (`page_id`) REFERENCES `web_page` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13626649 DEFAULT CHARSET=latin1;
dealer_visitors schema:
CREATE TABLE `dealer_visitors` (
`id` int NOT NULL AUTO_INCREMENT,
`visit_date` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`dealer_id` int NOT NULL,
`visitor_id` int NOT NULL,
`type` int DEFAULT NULL,
`notes` longtext,
`location` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dealer_visitors_dealer_id_306e2202_fk_dealer_id` (`dealer_id`),
KEY `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` (`visitor_id`),
KEY `dealer_visitors_type_af0f7d79` (`type`),
KEY `dealer_visitors_visit_date_f2b138c9` (`visit_date`),
CONSTRAINT `dealer_visitors_dealer_id_306e2202_fk_dealer_id` FOREIGN KEY (`dealer_id`) REFERENCES `dealer` (`id`),
CONSTRAINT `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` FOREIGN KEY (`visitor_id`) REFERENCES `visitor` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1524478 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
EXPLAIN ANALYZE the query gives me the following:
EXPLAIN:
For this query:
SELECT COUNT(*) AS `__count`
FROM visitor_page vp JOIN
dealer_visitors dv
ON vp.dealer_visitor_id = dv.id
WHERE vp.date_time BETWEEN '2021-02-01 05:51:00' AND '2021-03-21 05:50:00' AND
dv.dealer_id = 15;
The best indexes are on dealer_visitors(dealer_id, date_time, id) and visitor_page(dealer_visitor_id).
An index only on date helps a bit. But you are retrieving a month's worth of data and that might be a lot of data to process. Having dealer_id as the first column in the index will restrict the data to only the rows for that dealer in that time frame.
Depending on the distribution of the data, the Optimizer might pick one of the tables to start with, or pick the other. So, let's provide optimal indexes for each case:
ON `visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`
WHERE `visitor_page`.`date_time` BETWEEN ...
AND `dealer_visitors`.`dealer_id` = 15
Starting with visitor_page:
visitor_page: INDEX(date_time) -- (already exists)
dealer_visitors: (already has PRIMARY KEY(id))
Starting with dealer_visitors:
dealer_visitors: INDEX(dealer_id) -- (already exists)
visitor_page: INDEX(dealer_visitor_id, date_time) -- in this order
and drop dealer_visitors_visitor_id_27ae498e_fk_visitor_id as now being redundant.
The net is to add one index and drop one index.
Materialized view -- It is often best for Data Warehouse reports to build and incrementally maintain a "summary table" (a "materialized view"). The very odd date range (1 month + 20 days - 61 seconds) makes this clumsy to do. Typically it is handy to make the table based on whole days. If you can shift to daily (or hourly), then see http://mysql.rjweb.org/doc.php/summarytables
Something else to check: How much RAM do you have? What does SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; say?
I see that the tables have different charset/collation. This is not a problem for the query in question, but if you have other queries that JOIN on VARCHARs, check that they use the same collation.

MySQL. Why I cant update one only one column?

I have table:
CREATE TABLE `cold_water_volume_value` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`parameter_value_id` int(11) NOT NULL,
`time` timestamp(4) NOT NULL DEFAULT CURRENT_TIMESTAMP(4) ON UPDATE CURRENT_TIMESTAMP(4),
`value` double NOT NULL,
`device_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_cold_water_volume_value_id_device_time` (`parameter_value_id`,`device_id`,`time`),
KEY `idx_cold_water_volume_value_id_time` (`parameter_value_id`,`time`),
KEY `fk_cold_water_volume_value_device_id_idx` (`device_id`),
CONSTRAINT `fk_cold_water_volume_value_device_id` FOREIGN KEY (`device_id`) REFERENCES `device` (`id`) ON UPDATE SET NULL,
CONSTRAINT `fk_cold_water_volume_value_id` FOREIGN KEY (`parameter_value_id`) REFERENCES `cold_water_volume_parameter` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=684740 DEFAULT CHARSET=utf8;
And all rows have device_id = NULL. I want to update it by script:
UPDATE cold_water_volume_value SET device_id = 130101 WHERE parameter_value_id = 2120101;
But instead of replacing all device_id for picked parameter_value_id from null to given value, it sets all content of time and value columns to now () and some (seems like completely random from previous values) number.
Why it happens, and how to do it correct way?
time is automatically updated as per your schema.
`time` timestamp(4) NOT NULL DEFAULT CURRENT_TIMESTAMP(4) ON UPDATE CURRENT_TIMESTAMP(4)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To get around that you can set time to itself in your update.
UPDATE cold_water_volume_value
SET device_id = 130101, time = time
WHERE parameter_value_id = 2120101;
But that is likely there to track when the last time a row was updated. If so it's working as intended, leave it to do its thing.
As for value, that might have an update trigger on it. Check with show triggers and look for triggers on that table.
Your device_id is updated using content of time probably because in your index definition you mixed datatypes. It's worth noting that you should not mix datatypes especially on where clause when indexing.
Try to separate your indexes for example:
KEY idx_cold_water_volume_value_id_device_time (time),
KEY idx_cold_water_volume_value_id_device (parameter_value_id,device_id),
Try above statements for your definition and run query again.
It makes sense for the indexed column to have the same datatypes.
e.g. parameter_value_id and device_id

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

Indexing needs to be sped up

I have a table with the following details:
CREATE TABLE `test` (
`seenDate` datetime NOT NULL DEFAULT '0001-01-01 00:00:00',
`corrected_test` varchar(45) DEFAULT NULL,
`corrected_timestamp` timestamp NULL DEFAULT NULL,
`unable_to_correct` tinyint(1) DEFAULT '0',
`fk_zone_for_correction` int(11) DEFAULT NULL,
PRIMARY KEY (`sightinguid`),
KEY `corrected_test` (`corrected_test`),
KEY `idx_seenDate` (`seenDate`),
KEY `idx_corrected_test_seenDate` (`corrected_test`,`seenDate`),
KEY `zone_for_correction_fk_idx` (`fk_zone_for_correction`),
KEY `idx_corrected_test_zone` (`fk_zone_for_correction`,`corrected_test`,`seenDate`),
CONSTRAINT `zone_for_correction_fk` FOREIGN KEY (`fk_zone_for_correction`) REFERENCES `zone_test` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I am then using the following query:
SELECT
*
FROM
test
WHERE
fk_zone_for_correction = 1
AND (unable_to_correct = 0
OR unable_to_correct IS NULL)
AND (corrected_test = ''
OR corrected_test IS NULL)
AND (last_accessed_timestamp IS NULL
OR last_accessed_timestamp < (NOW() - INTERVAL 30 MINUTE))
ORDER BY seenDate ASC
LIMIT 1
Here is a screenshot of the optimiser - the ORDER BY is slowing things down, and in my opinion seems to be indexed properly, and the correct index (idx_corrected_test_zone) is being selected. What can be done to improve it?
There is no INDEX that will help much.
This might help:
INDEX(fk_zone_for_correction, seenDate)
Both columns can perhaps be used -- the first for filtering, the second for avoiding having to sort. But, it could backfire if it can't find the 1 row quickly.
The killer is OR. If you could avoid ever populating any of those 3 columns with NULL, then this might be better:
INDEX(fk_zone_for_correction, unable_to_correct, corrected_test, last_accessed_timestamp)
-- the range thing needs to be last
-- this index would do the filtering, but fail to help with `ORDER` and `LIMIT`.
Even though it is using idx_corrected_test_zone, it is probably not using more than the first two columns -- because of OR.
You have two cases of redundant indexes. For example, the first of these is the left part of the second; so the first is redundant and can be DROPped:
KEY `corrected_test` (`corrected_test`),
KEY `idx_corrected_test_seenDate` (`corrected_test`,`seenDate`),

Can we have primary key as function of column or set of column

I am trying to define a table where I want to define primary key has reverse of column. I was wondering if its possible in innodb ?
I am creating a table
create table abc
(
`id` varchar(255) PRIMARY KEY ,
`key` LONGTEXT NOT NULL,
`value` LONGTEXT NOT NULL ,
`last_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`created_on` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
)
but instead I want PRIMARY KEY (reverse(id))
MySQL doesn't support indexes on functions at all. So it clearly can't have them for a primary key.
Nor does MySQL support materialized views nor indices on views.1
Depending on what you're trying to accomplish, most likely you should just store your key the other way around. If your application depends on having it reverse of what's stored, create a view for the application to interact with. Unfortunately, updates will be difficult as MySQL doesn't do INSTEAD OF triggers or triggers on views at all.