MySQL composite index effect on joins - mysql

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

Related

why mysql still use index to get data when use the 2nd col of multiple column index in mysql?

Why mysql still use index to get data when use the 2nd col of multiple column index in mysql?
We know mysql use leftmost match rule, but here I didn't use the 1st col and I use the 2nd col, the two select operation results bellow show that mysql sometimes use index and sometimes didn't use it. Why? In addtion, my mysql version is 5.6.17.
1.create table:
CREATE TABLE `student` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`cid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_cid_INX` (`name`,`cid`)
) ENGINE=InnoDB AUTO_INCREMENT=101 DEFAULT CHARSET=utf8
2.run select:
EXPLAIN SELECT * FROM student WHERE cid=1;
3. result:
Result with index
It shows that mysql use index to get data.
The following is another table.
1.create table:
CREATE TABLE `test_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
`birthday` datetime DEFAULT NULL,
`address` varchar(45) DEFAULT NULL,
`phone` varchar(45) DEFAULT NULL,
`note` varchar(45) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `NAME` (`name`),
KEY `AGE` (`age`),
KEY `LeftMostPreFix` (`name`,`address`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
2.run select:
explain SELECT * FROM test.test_table where address = '东京'
3.result:
Result without index
On the contrary here it shows that mysql didn't use index to get data.
Comparing above two results, I feel puzzled why the 1st result use index which is against leftmost match rule.
From the mysql manual
it is possible that key will name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index. That is, the named index covers the selected columns, so although it is not used to determine which rows to retrieve, an index scan is more efficient than a data row scan.
So while there is a key used here, it's not actually used in the normal sense. In some situations it is still more efficient to use that as a table scan (in your first example), in others it might not be (in your second)
Most of the times these things are decided by the optimizer based on several things (usage of the table, etc).
Best thing to remember is that here you can NOT "use the index", and that's why there is no index in possible keys. You can only use the index if the first column is in there.
Neither index in either Case starts with what is in the WHERE, so there will be a full scan of table or of index.
Case 1: The index is "covering", so it is a tossup as to which (table scan vs index scan) is better. The Optimizer happened to pick the secondary index. EXPLAIN FORMAT=JSON SELECT ... may have enough details to explain 'why' in this case.
Case 2: Because of * (in SELECT *), the secondary index is at a disadvantage -- it is not "covering", so the processing will bounce back and forth between the index and the data. So it is clearly better to simply scan the table.
Instead of trying to understand EXPLAIN (in these cases), turn the question around: "What is the optimal index for this query against this table?" Then follow the guidelines here.

Should I be using multiple single-column indexes or a single multi-column index?

This is a pretty basic question, but I'm confused by what I'm reading in various places. I have a simple table that doesn't contain a huge amount of data (less than 500 rows for any given db is typical) A typical query against this table looks like :
select system_fields.name from system_fields where system_fields.form_id=? and system_fields.field_id=?
My question is, should I have a separate index for form_id and one for field_id, or should I be creating an index on a combination of those two fields? I've never really done anything with multi-column indexes before.
CREATE TABLE IF NOT EXISTS `system_fields` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`field_id` int(11) NOT NULL,
`form_id` int(11) NOT NULL,
`name` varchar(50) NOT NULL,
`reference_field_id` varchar(1000) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `field_id` (`field_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=293 ;
If you are always going to query by these two fields, then add a multi-column index.
I'll also point out that if you're going to have < 500 rows in the table, your index may not even get used. Any performance difference with or without an index on a 500-row table will be negligible.
Here's a bit more (good) reading:
https://www.percona.com/blog/2014/01/03/multiple-column-index-vs-multiple-indexes-with-mysql-56/

How to avoid full table scan

I have a MYSQL database around 50GB size with millions of rows. Here is my table structure
CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`mac` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`firstTime` datetime DEFAULT NULL,
`lastTime` datetime DEFAULT NULL,
`locid` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`isOut` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_logs_on_location_id` (`location_id`),
KEY `index_logs_on_client_id` (`client_id`),
KEY `macID` (`macID`)
) ENGINE=InnoDB AUTO_INCREMENT=39537721 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I was looking ways to avoid full table scans. I tried to add index for mac column. However when I run EXPLAIN on my queries, possible_keys and keys are always NULL when I don't use client_id in WHERE clause, otherwise my only used index is client_id or location_id which doesn't have a significant effect on my queries in the sense of execution time. I mainly use these types of queries(grouping,sorting etc..)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac,DATE(lastTime)
When you consider this type of table structure, how can I optimize my table to execute queries faster? I'm open to all suggestions. Thank you
To get MySQL (or Oracle, SQL Server, Postgres, MariaDB, DB2 and others) to use an index depends on how unique is the data in the mac column and how the distribution of the uniqueness is. The database engines mentioned use a cost based optimizer which estimates the cost of a certain solution and execute the solution with the lowest cost. Sometimes they are incorrect. This estimate can be influenced by playing with database parameters, however this can have unexpected side effects on other queries.
The second way to influence the result is to change the data structure.
The third way, most feasible is to influence the execution plan by providing a hint. For this lets assume an index is present on mac and lastTime so that the db engine only needs to load this index to do its job:
CREATE INDEX idx_mac_nn_1 ON logs(mac,lastTime);
The assumed to be optimized query is (so your version without the client_id column)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs FORCE INDEX idx_mac_nn_1
GROUP BY mac,DATE(lastTime);
This then should force MySQL to use the index no matter what.
For this query:
SELECT mac, COUNT(mac), DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac, DATE(lastTime)
You want an index on (client_id, mac, lastTime). I would suggest a covering index, if you don't mind the extra space required.

Mysql group by not using index

Given:
CREATE TABLE `APPLICATION_DEVICE_PUSHINFO` (
`applicationId` bigint(20) NOT NULL,
`deviceId` bigint(20) NOT NULL,
`active` bit(1) NOT NULL,
`inactiveAsOf` datetime DEFAULT NULL,
`lastSentOn` datetime DEFAULT NULL,
`registeredOn` datetime DEFAULT NULL,
`target` int(11) DEFAULT NULL,
`token` varchar(4096) NOT NULL,
PRIMARY KEY (`applicationId`,`deviceId`),
KEY `FKE7F2D58285EFFEAA_idx` (`deviceId`),
KEY `index3` (`token`(255)) USING BTREE,
CONSTRAINT `FKE7F2D58285EFFEAA` FOREIGN KEY (`deviceId`) REFERENCES `DEVICES` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If I execute the following query:
explain SELECT token FROM APPLICATION_DEVICE_PUSHINFO group by token having count(deviceId) > 1;;
I get:
'1', 'SIMPLE', 'APPLICATION_DEVICE_PUSHINFO', 'ALL', NULL, NULL, NULL, NULL, '7', 'Using temporary; Using filesort'
The null values belongs to possible keys etc.
Why the index for column token is not used ?
As you don't have a WHERE clause, the query needs to process all rows (note that the HAVING clause is applied after the GROUP BY—hence it doesn't limit the row to be processed, just those that are returned).
If you need to touch all rows anyways, it's hard to gain any benefit from an index. Nevertheless it is possible to gain something if you'r able to do an Index Only Scan (IOS) and/or benefit from the pre-orderd data on disk.
However, an IOS might (not sure if MySQL considers the NOT NULL constraint) be prevented by the fact that you access the deviceId column which is not included in the index that could possibly used for this query (index3). Note that you need to have ONE index that covers all needs of the query to get an index only scan. However, if MySQL is smart enough and recognizes the NOT NULL constraint, this shouldn't be an issue. Otherwise, rewrite your query. e.g count(*) > 1.
In this particular case, your chances to get an IOS as bad anyways, because of the small size of the table (at least according to the optimizers estimates) (as already mentioned by Strawberry).
If you need to make sure it works with more rows as well, just fill up the table and see if it changes the execution plan. If not, change the query as mentioned above, try again. If not, return here and we'll see (post new execution plan).
You desire to execute this query via index is in principle reasonable. Making it work is another story :(

Optimising a slow MySQL query

I have a MySQL query as follows:
SELECT KeywordText, SUM(Frequency) AS Frequency FROM Keyword, Keyword_Polling_Frequency_Index
WHERE Keyword.KeywordText
IN ('deal', 'obama' and other keywords...)
AND RSSFeedNo IN (106, 107 and other RSS feeds)
AND PollingDateTime
BETWEEN '2011-10-28 13:00:00' AND '2011-10-28 13:59:00'
AND Keyword.KeywordNo = Keyword_Polling_Frequency_Index.KeywordNo
GROUP BY Keyword.KeywordText
ORDER BY Keyword.KeywordText ASC
The query is used by an hourly batch program which involves two tables and is meant to get the frequencies of a list of keywords from a list of RSS feeds for a given hour. The Keyword_Polling_Frequency_Index table has a composite primary key of KeywordNo, RSSFeedNo and PollingDateTime. The query joins this table to the Keyword table which contains the KeywordText. column keywordText has a MySQL MyISAM full text index.
In testing this was found to perform satisfactorily but has now started running very slowly and affects the interactive speed of pages of the application. When I check the MySQL logs, I find that MySQL is creating temporary tables.
So, my question is, given that this query has to handle dozens of keywords in dozens of RSS feeds to calculate the frequencies, can anyone suggest an optimisation?
I have thought of breaking the query up by keyword but am not convinced of the practicality of this.
Can anyone help?
I am using MySQL Community Edition 5.X and an EXTENDED EXPLAIN of a version of this query is shown above.
SQL for the tables is as follows:
CREATE TABLE `keyword` (
`KeywordNo` int(10) unsigned NOT NULL AUTO_INCREMENT,
`KeywordText` varchar(64) NOT NULL,
`UserOriginated` enum('TRUE','FALSE') NOT NULL,
`Active` enum('TRUE','FALSE') NOT NULL,
`UserNo` varchar(50) NOT NULL,
`StopWord` enum('TRUE','FALSE') NOT NULL,
`CreatedDate` date NOT NULL,
`CreatedTime` time NOT NULL,
PRIMARY KEY (`KeywordNo`),
FULLTEXT KEY `KEYWORDTEXT` (`KeywordText`)
) ENGINE=MyISAM AUTO_INCREMENT=44047 DEFAULT CHARSET=latin1$$
CREATE TABLE `keyword_polling_frequency_index` (
`KeywordNo` int(10) unsigned NOT NULL,
`RSSFeedNo` int(10) unsigned NOT NULL,
`PollingDateTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`Frequency` int(10) NOT NULL,
`Active` enum('TRUE','FALSE') NOT NULL,
`UserNo` varchar(50) NOT NULL,
PRIMARY KEY (`KeywordNo`,`RSSFeedNo`,`PollingDateTime`),
KEY `FK_keyword_polling_frequency_index_1` (`UserNo`),
CONSTRAINT `FK_keyword_polling_frequency_index_1` FOREIGN KEY (`UserNo`) REFERENCES `user` (`UserNo`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1$$
As mentioned previously, add an index to the PollingDateTime field in the order mentioned as well. This is my suggestion:
SELECT
K.KeywordText,
SUM(F.Frequency) AS Frequency
FROM
Keyword K, Keyword_Polling_Frequency_Index F
WHERE
EXISTS
(
SELECT 1
FROM Keyword K1
WHERE
MATCH K1.KeywordText AGAINST ('deal obama "another keyword" yetanother' IN BOOLEAN MODE)
AND K1.KeywordNo = K.KeywordNo
)
AND K.KeywordNo = F.KeywordNo
AND F.PollingDateTime BETWEEN '2011-10-28 13:00:00' AND '2011-10-28 13:59:00'
AND F.RSSFeedNo IN (106, 107, 110)
GROUP BY K.KeywordText
ORDER BY K.KeywordText ASC
This will probably reduce the number of records for the comparison (SQL inside-out parsing) instead of directly matching two tables (N x N).
If you don't have any indexes you should create relevant indexes.
The minimum index is on keyword_polling_frequency_index.PollingDateTime