How to avoid full table scan - mysql

I have a MYSQL database around 50GB size with millions of rows. Here is my table structure
CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`mac` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`firstTime` datetime DEFAULT NULL,
`lastTime` datetime DEFAULT NULL,
`locid` int(11) DEFAULT NULL,
`client_id` int(11) DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`isOut` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_logs_on_location_id` (`location_id`),
KEY `index_logs_on_client_id` (`client_id`),
KEY `macID` (`macID`)
) ENGINE=InnoDB AUTO_INCREMENT=39537721 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I was looking ways to avoid full table scans. I tried to add index for mac column. However when I run EXPLAIN on my queries, possible_keys and keys are always NULL when I don't use client_id in WHERE clause, otherwise my only used index is client_id or location_id which doesn't have a significant effect on my queries in the sense of execution time. I mainly use these types of queries(grouping,sorting etc..)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac,DATE(lastTime)
When you consider this type of table structure, how can I optimize my table to execute queries faster? I'm open to all suggestions. Thank you

To get MySQL (or Oracle, SQL Server, Postgres, MariaDB, DB2 and others) to use an index depends on how unique is the data in the mac column and how the distribution of the uniqueness is. The database engines mentioned use a cost based optimizer which estimates the cost of a certain solution and execute the solution with the lowest cost. Sometimes they are incorrect. This estimate can be influenced by playing with database parameters, however this can have unexpected side effects on other queries.
The second way to influence the result is to change the data structure.
The third way, most feasible is to influence the execution plan by providing a hint. For this lets assume an index is present on mac and lastTime so that the db engine only needs to load this index to do its job:
CREATE INDEX idx_mac_nn_1 ON logs(mac,lastTime);
The assumed to be optimized query is (so your version without the client_id column)
SELECT mac,COUNT(mac),DATE(lastTime)
FROM logs FORCE INDEX idx_mac_nn_1
GROUP BY mac,DATE(lastTime);
This then should force MySQL to use the index no matter what.

For this query:
SELECT mac, COUNT(mac), DATE(lastTime)
FROM logs
WHERE client_id = 1
GROUP BY mac, DATE(lastTime)
You want an index on (client_id, mac, lastTime). I would suggest a covering index, if you don't mind the extra space required.

Related

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

Mysql partitioning effect on DDL and DML

I am using Mysql 5.6 with ~150 million records in Transaction table (InnodB). As the size is increasing this table is becoming unmanageable (adding column or index) and slow even with required indexing. After searching through internet I found it is appropriate time to partition the table. I am confidant that partitioning will solve following purpose for me
Improve DML statements response time (using partitioning pruning)
Improve archival process
But I am not sure wether (and how) it will improve DDL performance for this table or not. More specifically following DDL's performance.
ALTER TABLE ADD/DROP COLUMN
ALTER TABLE ADD/DROP INDEX
I went through Mysql documentation and internet but unable to find my answer. Can anyone please help me in this or provide any relevant documentation for this.
My table structure is as following
CREATE TABLE `TRANSACTION` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`parent_id` int(11) DEFAULT NULL,
`parent_uuid` char(36) DEFAULT NULL,
`order_number` varchar(64) DEFAULT NULL,
`order_id` int(11) DEFAULT NULL,
`order_uuid` char(36) DEFAULT NULL,
`order_type` char(1) DEFAULT NULL,
`business_id` int(11) DEFAULT NULL,
`store_id` int(11) DEFAULT NULL,
`store_device_id` int(11) DEFAULT NULL,
`source` char(1) DEFAULT NULL COMMENT 'instore, online, order_ahead, etc',
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`flags` int(11) DEFAULT NULL,
`customer_lang` char(2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `parent_id` (`parent_id`),
KEY `business_id` (`business_id`,`store_id`,`store_device_id`),
KEY `parent_uuid` (`parent_uuid`),
KEY `order_uuid` (`order_uuid`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
And I am partitioning using following statement.
ALTER TABLE TRANSACTION PARTITION BY RANGE (id)
(PARTITION p0 VALUES LESS THAN (5000000) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (10000000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Thanks!
Partitioning is not a performance panacea. Even the items you mentioned will not speed up; they may even slow down.
Instead, I will critique the table to look for ways to speed up some things.
UUIDs are terrible for performance once the index on it becomes too big to be cached. This is because of its randomness. Possible solutions: compact it into BINARY(16); shrink the table other ways; avoid UUIDs.
Why have both parent_id and parent_uuid??
Shrink the 4-byte INTs to smaller datatypes where practical.
Usually CHAR should be CHARACTER SET ascii (1-byte/character), not utf8mb4 (4 bytes/char).
Caution: 150M is getting remotely close to the 2-billion limit of INT SIGNED. Consider 4B limit of INT UNSIGNED. (Each is 4 bytes.)
Do you ever use created_at or updated_at?
MySQL 8.0.13 has a very fast ADD COLUMN and DROP COLUMN (for limited situations).
5.7.?? has a less-invasive ADD INDEX than previous versions, but I am not sure it applies to partitioned tables.
5.7.4: Online DDL support reduces table rebuild time and permits concurrent DML, which helps reduce user application downtime. For additional information, see Overview of Online DDL.
More importantly, let's see the main queries that are "too slow". There may be composite indexes and/or reformulations of the queries that will speed them up.
There is even a slim chance that partitioning will help but not on the PRIMARY KEY.
I think there are only 4 use cases where partitioning helps performance.

why mysql still use index to get data when use the 2nd col of multiple column index in mysql?

Why mysql still use index to get data when use the 2nd col of multiple column index in mysql?
We know mysql use leftmost match rule, but here I didn't use the 1st col and I use the 2nd col, the two select operation results bellow show that mysql sometimes use index and sometimes didn't use it. Why? In addtion, my mysql version is 5.6.17.
1.create table:
CREATE TABLE `student` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`cid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_cid_INX` (`name`,`cid`)
) ENGINE=InnoDB AUTO_INCREMENT=101 DEFAULT CHARSET=utf8
2.run select:
EXPLAIN SELECT * FROM student WHERE cid=1;
3. result:
Result with index
It shows that mysql use index to get data.
The following is another table.
1.create table:
CREATE TABLE `test_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(45) DEFAULT NULL,
`birthday` datetime DEFAULT NULL,
`address` varchar(45) DEFAULT NULL,
`phone` varchar(45) DEFAULT NULL,
`note` varchar(45) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `NAME` (`name`),
KEY `AGE` (`age`),
KEY `LeftMostPreFix` (`name`,`address`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
2.run select:
explain SELECT * FROM test.test_table where address = '东京'
3.result:
Result without index
On the contrary here it shows that mysql didn't use index to get data.
Comparing above two results, I feel puzzled why the 1st result use index which is against leftmost match rule.
From the mysql manual
it is possible that key will name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index. That is, the named index covers the selected columns, so although it is not used to determine which rows to retrieve, an index scan is more efficient than a data row scan.
So while there is a key used here, it's not actually used in the normal sense. In some situations it is still more efficient to use that as a table scan (in your first example), in others it might not be (in your second)
Most of the times these things are decided by the optimizer based on several things (usage of the table, etc).
Best thing to remember is that here you can NOT "use the index", and that's why there is no index in possible keys. You can only use the index if the first column is in there.
Neither index in either Case starts with what is in the WHERE, so there will be a full scan of table or of index.
Case 1: The index is "covering", so it is a tossup as to which (table scan vs index scan) is better. The Optimizer happened to pick the secondary index. EXPLAIN FORMAT=JSON SELECT ... may have enough details to explain 'why' in this case.
Case 2: Because of * (in SELECT *), the secondary index is at a disadvantage -- it is not "covering", so the processing will bounce back and forth between the index and the data. So it is clearly better to simply scan the table.
Instead of trying to understand EXPLAIN (in these cases), turn the question around: "What is the optimal index for this query against this table?" Then follow the guidelines here.

Mysql group by not using index

Given:
CREATE TABLE `APPLICATION_DEVICE_PUSHINFO` (
`applicationId` bigint(20) NOT NULL,
`deviceId` bigint(20) NOT NULL,
`active` bit(1) NOT NULL,
`inactiveAsOf` datetime DEFAULT NULL,
`lastSentOn` datetime DEFAULT NULL,
`registeredOn` datetime DEFAULT NULL,
`target` int(11) DEFAULT NULL,
`token` varchar(4096) NOT NULL,
PRIMARY KEY (`applicationId`,`deviceId`),
KEY `FKE7F2D58285EFFEAA_idx` (`deviceId`),
KEY `index3` (`token`(255)) USING BTREE,
CONSTRAINT `FKE7F2D58285EFFEAA` FOREIGN KEY (`deviceId`) REFERENCES `DEVICES` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
If I execute the following query:
explain SELECT token FROM APPLICATION_DEVICE_PUSHINFO group by token having count(deviceId) > 1;;
I get:
'1', 'SIMPLE', 'APPLICATION_DEVICE_PUSHINFO', 'ALL', NULL, NULL, NULL, NULL, '7', 'Using temporary; Using filesort'
The null values belongs to possible keys etc.
Why the index for column token is not used ?
As you don't have a WHERE clause, the query needs to process all rows (note that the HAVING clause is applied after the GROUP BY—hence it doesn't limit the row to be processed, just those that are returned).
If you need to touch all rows anyways, it's hard to gain any benefit from an index. Nevertheless it is possible to gain something if you'r able to do an Index Only Scan (IOS) and/or benefit from the pre-orderd data on disk.
However, an IOS might (not sure if MySQL considers the NOT NULL constraint) be prevented by the fact that you access the deviceId column which is not included in the index that could possibly used for this query (index3). Note that you need to have ONE index that covers all needs of the query to get an index only scan. However, if MySQL is smart enough and recognizes the NOT NULL constraint, this shouldn't be an issue. Otherwise, rewrite your query. e.g count(*) > 1.
In this particular case, your chances to get an IOS as bad anyways, because of the small size of the table (at least according to the optimizers estimates) (as already mentioned by Strawberry).
If you need to make sure it works with more rows as well, just fill up the table and see if it changes the execution plan. If not, change the query as mentioned above, try again. If not, return here and we'll see (post new execution plan).
You desire to execute this query via index is in principle reasonable. Making it work is another story :(

Changing data organization on disk in MySQL

We have a data set that is fairly static in a MySQL database, but the read times are terrible (even with indexes on the columns being queried). The theory is that since rows are stored randomly (or sometimes in order of insertion), the disk head has to scan around to find different rows, even if it knows where they are due to the index, instead of just reading them sequentially.
Is it possible to change the order data is stored in on disk so that it can be read sequentially? Unfortunately, we can't add a ton more RAM at the moment to have all the queries cached. If it's possible to change the order, can we define an order within an order? As in, sort by a certain column, then sort by another column if the first column is equal.
Could this have something to do with the indices?
Additional details: non-relational single-table database with 16 million rows, 1 GB of data total, 512 mb RAM, MariaDB 5.5.30 on Ubuntu 12.04 with a standard hard drive. Also this is a virtualized machine using OpenVZ, 2 dedicated core E5-2620 2Ghz CPU
Create syntax:
CREATE TABLE `Events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`provider` varchar(10) DEFAULT NULL,
`location` varchar(5) DEFAULT NULL,
`start_time` datetime DEFAULT NULL,
`end_time` datetime DEFAULT NULL,
`cost` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `provider` (`provider`),
KEY `location` (`location`),
KEY `start_time` (`start_time`),
KEY `end_time` (`end_time`),
KEY `cost` (`cost`)
) ENGINE=InnoDB AUTO_INCREMENT=16321002 DEFAULT CHARSET=utf8;
Select statement that takes a long time:
SELECT *
FROM `Events`
WHERE `Events`.start_time >= '2013-05-03 23:00:00' AND `Events`.start_time <= '2013-06-04 22:00:00' AND `FlightRoutes`.location = 'Chicago'
Explain select:
1 SIMPLE Events ref location,start_time location 18 const 3684 Using index condition; Using where
MySQL can only select one index upon which to filter (which makes sense, because having restricted the results using an index it cannot then determine how such restriction has affected other indices). Therefore, it tracks the cardinality of each index and chooses the one that is likely to be the most selective (i.e. has the highest cardinality): in this case, it has chosen the location index, but that will typically leave 3,684 records that must be fetched and then filtered Using where to find those that match the desired range of start_time.
You should try creating a composite index over (location, start_time):
ALTER TABLE Events ADD INDEX (location, start_time)