Mysql group by not using index - mysql

Given:
CREATE TABLE `APPLICATION_DEVICE_PUSHINFO` (
`applicationId` bigint(20) NOT NULL,
`deviceId` bigint(20) NOT NULL,
`active` bit(1) NOT NULL,
`inactiveAsOf` datetime DEFAULT NULL,
`lastSentOn` datetime DEFAULT NULL,
`registeredOn` datetime DEFAULT NULL,
`target` int(11) DEFAULT NULL,
`token` varchar(4096) NOT NULL,
PRIMARY KEY (`applicationId`,`deviceId`),
KEY `FKE7F2D58285EFFEAA_idx` (`deviceId`),
KEY `index3` (`token`(255)) USING BTREE,
CONSTRAINT `FKE7F2D58285EFFEAA` FOREIGN KEY (`deviceId`) REFERENCES `DEVICES` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If I execute the following query:
explain SELECT token FROM APPLICATION_DEVICE_PUSHINFO group by token having count(deviceId) > 1;;
I get:
'1', 'SIMPLE', 'APPLICATION_DEVICE_PUSHINFO', 'ALL', NULL, NULL, NULL, NULL, '7', 'Using temporary; Using filesort'
The null values belongs to possible keys etc.
Why the index for column token is not used ?

As you don't have a WHERE clause, the query needs to process all rows (note that the HAVING clause is applied after the GROUP BY—hence it doesn't limit the row to be processed, just those that are returned).
If you need to touch all rows anyways, it's hard to gain any benefit from an index. Nevertheless it is possible to gain something if you'r able to do an Index Only Scan (IOS) and/or benefit from the pre-orderd data on disk.
However, an IOS might (not sure if MySQL considers the NOT NULL constraint) be prevented by the fact that you access the deviceId column which is not included in the index that could possibly used for this query (index3). Note that you need to have ONE index that covers all needs of the query to get an index only scan. However, if MySQL is smart enough and recognizes the NOT NULL constraint, this shouldn't be an issue. Otherwise, rewrite your query. e.g count(*) > 1.
In this particular case, your chances to get an IOS as bad anyways, because of the small size of the table (at least according to the optimizers estimates) (as already mentioned by Strawberry).
If you need to make sure it works with more rows as well, just fill up the table and see if it changes the execution plan. If not, change the query as mentioned above, try again. If not, return here and we'll see (post new execution plan).
You desire to execute this query via index is in principle reasonable. Making it work is another story :(

Related

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

why mysql still use index to get data when use the 2nd col of multiple column index in mysql?

Why mysql still use index to get data when use the 2nd col of multiple column index in mysql?
We know mysql use leftmost match rule, but here I didn't use the 1st col and I use the 2nd col, the two select operation results bellow show that mysql sometimes use index and sometimes didn't use it. Why? In addtion, my mysql version is 5.6.17.
1.create table:
CREATE TABLE `student` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`cid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_cid_INX` (`name`,`cid`)
) ENGINE=InnoDB AUTO_INCREMENT=101 DEFAULT CHARSET=utf8
2.run select:
EXPLAIN SELECT * FROM student WHERE cid=1;
3. result:
Result with index
It shows that mysql use index to get data.
The following is another table.
1.create table:
CREATE TABLE `test_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
`birthday` datetime DEFAULT NULL,
`address` varchar(45) DEFAULT NULL,
`phone` varchar(45) DEFAULT NULL,
`note` varchar(45) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `NAME` (`name`),
KEY `AGE` (`age`),
KEY `LeftMostPreFix` (`name`,`address`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
2.run select:
explain SELECT * FROM test.test_table where address = '东京'
3.result:
Result without index
On the contrary here it shows that mysql didn't use index to get data.
Comparing above two results, I feel puzzled why the 1st result use index which is against leftmost match rule.
From the mysql manual
it is possible that key will name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index. That is, the named index covers the selected columns, so although it is not used to determine which rows to retrieve, an index scan is more efficient than a data row scan.
So while there is a key used here, it's not actually used in the normal sense. In some situations it is still more efficient to use that as a table scan (in your first example), in others it might not be (in your second)
Most of the times these things are decided by the optimizer based on several things (usage of the table, etc).
Best thing to remember is that here you can NOT "use the index", and that's why there is no index in possible keys. You can only use the index if the first column is in there.
Neither index in either Case starts with what is in the WHERE, so there will be a full scan of table or of index.
Case 1: The index is "covering", so it is a tossup as to which (table scan vs index scan) is better. The Optimizer happened to pick the secondary index. EXPLAIN FORMAT=JSON SELECT ... may have enough details to explain 'why' in this case.
Case 2: Because of * (in SELECT *), the secondary index is at a disadvantage -- it is not "covering", so the processing will bounce back and forth between the index and the data. So it is clearly better to simply scan the table.
Instead of trying to understand EXPLAIN (in these cases), turn the question around: "What is the optimal index for this query against this table?" Then follow the guidelines here.

MySQL SELECT return wrong results

I'm working with MySQL 5.7. I created a table with a virtual column (not stored) of type DATETIME with an index on it. While I was working on it, I noticed that order by was not returning all the data (some data I was expecting at the top was missing). Also the results from MAX and MIN were wrong.
After I run
ANALYZE TABLE
CHECK TABLE
OPTIMIZE TABLE
then the results were correct. I guess there was an issue with the index data, so I have few questions:
When and why this could happen?
Is there a way to prevent this?
among the 3 command I run, which is the correct one to use?
I'm worried that this could happen in the future but I'll not notice.
EDIT:
as requested in the comments I added the table definition:
CREATE TABLE `items` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned DEFAULT NULL,
`image` json DEFAULT NULL,
`status` json DEFAULT NULL,
`status_expired` tinyint(1) GENERATED ALWAYS AS (ifnull(json_contains(`status`,'true','$.expired'),false)) VIRTUAL COMMENT 'used for index: it checks if status contains expired=true',
`lifetime` tinyint(4) NOT NULL,
`expiration` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
`last_update` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `expiration` (`status_expired`,`expiration`) USING BTREE,
CONSTRAINT `ts_competition_item_ibfk_2` FOREIGN KEY (`user_id`) REFERENCES `ts_user_core` (`user_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1312459 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED
Queries that were returning the wrong results:
SELECT * FROM items ORDER BY expiration DESC;
SELECT max(expiration),min(expiration) FROM items;
Thanks
TLDR;
The trouble is that your data comes from virtual columns materialized via indexes. The check, optimize, analyze operations you are doing forces the indexes to be synced and fixes any errors. That gives you the correct results henceforth. At least until the index gets out of sync again.
Why it may happen
Much of the problems are caused by issues with your table design. Let's start with.
`status_expired` tinyint(1) GENERATED ALWAYS AS (ifnull(json_contains(`status`,'true','$.expired'),false)) VIRTUAL
No doubt this is created to overcome the fact that you cannot directly index a JSON column in mysql. You have created a virtual column and indexed that instead. It's all very well, but this column can hold only one of two values; true or false. Which means it has very poor cadinality. As a result, mysql is unlikely to use this index for anything.
But we can see that you have combined the status_expired column with the expired column when creating the index. Perhaps with the idea of overcoming this poor cardinality mentioned above. But wait...
`expiration` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
Expiration is another virtual column. This has some repercussions.
When a secondary index is created on a generated virtual column,
generated column values are materialized in the records of the index.
If the index is a covering index (one that includes all the columns
retrieved by a query), generated column values are retrieved from
materialized values in the index structure instead of computed “on the
fly”.
Ref: https://dev.mysql.com/doc/refman/5.7/en/create-table-secondary-indexes.html#json-column-indirect-index
This is contrary to
VIRTUAL: Column values are not stored, but are evaluated when rows are
read, immediately after any BEFORE triggers. A virtual column takes no
storage.
Ref: https://dev.mysql.com/doc/refman/5.7/en/create-table-generated-columns.html
We create virtual columns based on the sound principal that values generated by simple operations on columns shouldn't be stored to avoid redundancy, but by creating an index on it, we reintroduce redundancy.
Proposed fixes
based on the information provided, you don't really seem to need the status_expired column or even the expired column. An item that's past it's expiry date is expired!
CREATE TABLE `items` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`user_id` bigint(20) unsigned DEFAULT NULL,
`image` json DEFAULT NULL,
`status` json DEFAULT NULL,
`expire_date` datetime GENERATED ALWAYS AS ((`create_date` + interval `lifetime` day)) VIRTUAL,
`last_update` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `expiration` (`expired_date`) USING BTREE,
CONSTRAINT `ts_competition_item_ibfk_2` FOREIGN KEY (`user_id`) REFERENCES `ts_user_core` (`user_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1312459 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED
Simply compare the current date with the expired_date column in the above table when you need to find out which items have expired. The difference here is instead of expired being a calculated item in every query, you calculate the expiry_date once, when you create the record.
This makes your table a lot neater and queries possibly faster

How to avoid full table scan

I have a MYSQL database around 50GB size with millions of rows. Here is my table structure
CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`mac` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`firstTime` datetime DEFAULT NULL,
`lastTime` datetime DEFAULT NULL,
`locid` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`isOut` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_logs_on_location_id` (`location_id`),
KEY `index_logs_on_client_id` (`client_id`),
KEY `macID` (`macID`)
) ENGINE=InnoDB AUTO_INCREMENT=39537721 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I was looking ways to avoid full table scans. I tried to add index for mac column. However when I run EXPLAIN on my queries, possible_keys and keys are always NULL when I don't use client_id in WHERE clause, otherwise my only used index is client_id or location_id which doesn't have a significant effect on my queries in the sense of execution time. I mainly use these types of queries(grouping,sorting etc..)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac,DATE(lastTime)
When you consider this type of table structure, how can I optimize my table to execute queries faster? I'm open to all suggestions. Thank you
To get MySQL (or Oracle, SQL Server, Postgres, MariaDB, DB2 and others) to use an index depends on how unique is the data in the mac column and how the distribution of the uniqueness is. The database engines mentioned use a cost based optimizer which estimates the cost of a certain solution and execute the solution with the lowest cost. Sometimes they are incorrect. This estimate can be influenced by playing with database parameters, however this can have unexpected side effects on other queries.
The second way to influence the result is to change the data structure.
The third way, most feasible is to influence the execution plan by providing a hint. For this lets assume an index is present on mac and lastTime so that the db engine only needs to load this index to do its job:
CREATE INDEX idx_mac_nn_1 ON logs(mac,lastTime);
The assumed to be optimized query is (so your version without the client_id column)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs FORCE INDEX idx_mac_nn_1
GROUP BY mac,DATE(lastTime);
This then should force MySQL to use the index no matter what.
For this query:
SELECT mac, COUNT(mac), DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac, DATE(lastTime)
You want an index on (client_id, mac, lastTime). I would suggest a covering index, if you don't mind the extra space required.

GROUP BY Query -why so slow

I am trying to generate a group query on a large table (more than 8 million rows). However I can reduce the need to group all the data by date. I have a view that captures that dates I require and this limits the query bu it's not much better.
Finally I need to join to another table to pick up a field.
I am showing the query, the create on the main table and the query explain below.
Main Query:
SELECT pgi_raw_data.wsp_channel,
'IOM' AS wsp,
pgi_raw_data.dated,
pgi_accounts.`master`,
pgi_raw_data.event_id,
pgi_raw_data.breed,
Sum(pgi_raw_data.handle),
Sum(pgi_raw_data.payout),
Sum(pgi_raw_data.rebate),
Sum(pgi_raw_data.profit)
FROM pgi_raw_data
INNER JOIN summary_max
ON pgi_raw_data.wsp_channel = summary_max.wsp_channel
AND pgi_raw_data.dated > summary_max.race_date
INNER JOIN pgi_accounts
ON pgi_raw_data.account = pgi_accounts.account
GROUP BY pgi_raw_data.event_id
ORDER BY NULL
The create table:
CREATE TABLE `pgi_raw_data` (
`event_id` char(25) NOT NULL DEFAULT '',
`wsp_channel` varchar(5) NOT NULL,
`dated` date NOT NULL,
`time` time DEFAULT NULL,
`program` varchar(5) NOT NULL,
`track` varchar(25) NOT NULL,
`raceno` tinyint(2) NOT NULL,
`detail` varchar(30) DEFAULT NULL,
`ticket` varchar(20) NOT NULL DEFAULT '',
`breed` varchar(12) NOT NULL,
`pool` varchar(10) NOT NULL,
`gross` decimal(11,2) NOT NULL,
`refunds` decimal(11,2) NOT NULL,
`handle` decimal(11,2) NOT NULL,
`payout` decimal(11,4) NOT NULL,
`rebate` decimal(11,4) NOT NULL,
`profit` decimal(11,4) NOT NULL,
`account` mediumint(10) NOT NULL,
PRIMARY KEY (`event_id`,`ticket`),
KEY `idx_account` (`account`),
KEY `idx_wspchannel` (`wsp_channel`,`dated`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
This is my view for summary_max:
CREATE ALGORITHM=UNDEFINED DEFINER=`root`#`localhost` SQL SECURITY DEFINER VIEW
`summary_max` AS select `pgi_summary_tbl`.`wsp_channel` AS
`wsp_channel`,max(`pgi_summary_tbl`.`race_date`) AS `race_date`
from `pgi_summary_tbl` group by `pgi_summary_tbl`.`wsp
And also the evaluated query:
1 PRIMARY <derived2> ALL 6 Using temporary
1 PRIMARY pgi_raw_data ref idx_account,idx_wspchannel idx_wspchannel
7 summary_max.wsp_channel 470690 Using where
1 PRIMARY pgi_accounts ref PRIMARY PRIMARY 3 gf3data_momutech.pgi_raw_data.account 29 Using index
2 DERIVED pgi_summary_tbl ALL 42282 Using temporary; Using filesort
Any help on indexing would help.
At a minimum you need indexes on these fields:
pgi_raw_data.wsp_channel,
pgi_raw_data.dated,
pgi_raw_data.account
pgi_raw_data.event_id,
summary_max.wsp_channel,
summary_max.race_date,
pgi_accounts.account
The general (not always) rule is anything you are sorting, grouping, filtering or joining on should have an index.
Also: pgi_summary_tbl.wsp
Also, why the order by null?
The first thing is to be sure that you have indexes on pgi_summary_table(wsp_channel, race_date) and pgi_accounts(account). For this query, you don't need indexes on these columns in the raw data.
MySQL has a tendency to use indexes even when they are not the most efficient path. I would start by looking at the performance of the "full" query, without the joins:
SELECT pgi_raw_data.wsp_channel,
'IOM' AS wsp,
pgi_raw_data.dated,
-- pgi_accounts.`master`,
pgi_raw_data.event_id,
pgi_raw_data.breed,
Sum(pgi_raw_data.handle),
Sum(pgi_raw_data.payout),
Sum(pgi_raw_data.rebate),
Sum(pgi_raw_data.profit)
FROM pgi_raw_data
GROUP BY pgi_raw_data.event_id
If this has better performance, you may have a situation where the indexes are working against you. The specific problem is called "thrashing". It occurs when a table is too bit to fit into memory. Often, the fastest way to deal with such a table is to just read the whole thing. Accessing the table through an index can result in an extra I/O operation for most of the rows.
If this works, then do the joins after the aggregate. Also, consider getting more memory, so the whole table will fit into memory.
Second, if you have to deal with this type of data, then partitioning the table by date may prove to be a very useful option. This will allow you to significantly reduce the overhead of reading the large table. You do have to be sure that the summary table can be read the same way.