restriction on group by column - mysql

I have the following tables and query in MySQL:
CREATE TABLE IF NOT EXISTS `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`data` json,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE matching_pv_names (
pv_name varchar(60) NOT NULL,
PRIMARY KEY (pv_name)
) ENGINE=Memory;
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= #time_stamp_in
GROUP BY events.pv_name;
The query as it stands runs efficiently with 'Using index for group-by'. Is it possible to modify it to restrict the set of pv_names it groups on to those in the matching_pv_names table and still keep the 'Using index for group-by' optimization? For example, the following query no longer uses this optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= #time_stamp_in
AND events.pv_name IN (SELECT matching_pv_names.pv_name FROM matching_pv_names)
GROUP BY events.pv_name;
Is there another way to write it so that it does?

Your first SQL can benefit from GROUP BY optimization because it uses one table only and the column that you use for GROUP BY has index on it and the only aggregate function you use is MAX(). and you use constant in your WHERE clause.
Once you add another table in the query then GROUP BY optimization cannot be applied.

You have asked about a specific optimization, but isn't the real question about efficiency?
See how well this works:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events AS e
JOIN matching_pv_names AS m USING(pv_name)
WHERE e.time_stamp <= #time_stamp_in
GROUP BY e.pv_name;
One way to compare the efficiency of two queries, even when the tables are small, is
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Historically, this construct has optimized poorly: IN ( SELECT ... ). (I don't know whether it is working poorly for your query in your version.)

Related

How to optimize an UPDATE and JOIN query on practically identical tables?

I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.
Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.
Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.
ALTER TABLE `customersPrimary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
ALTER TABLE `customersSecondary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
And then you can add updatedAt to your join criteria and the WHERE clause -
UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;
For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -
UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;
Plan A:
Something like this would first find the "new" rows, then add only those:
UPDATE primary
SET ...
JOIN ( SELECT ...
FROM secondary
LEFT JOIN primary
WHERE primary... IS NULL )
ON ...
Might secondary have changes? If so, a variant of that would work.
Plan B:
Better yet is to TRUNCATE TABLE secondary after it is folded into primary.

MySQL Query/Table in need of optimization

I have a query that is taking an embarrassingly long time. ~7 minutes embarrassing. I would really appreciate some help. Missing indexes? Rewrite the query? All of the above?
Many thanks
mysql Ver 14.14 Distrib 5.7.25, for Linux (x86_64)
The query looks like:
SELECT COUNT(*) AS count_all, name
FROM api_events ae
INNER JOIN products p on p.token=ae.product_token
WHERE (ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY name
Here are the two table definitions
api_events has ~31 million records
CREATE TABLE `api_events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`api_name` varchar(200) NOT NULL,
`hostname` varchar(200) NOT NULL,
`controller_action` varchar(2000) NOT NULL,
`duration` decimal(12,5) NOT NULL DEFAULT '0.00000',
`view` decimal(12,5) NOT NULL DEFAULT '0.00000',
`db` decimal(12,5) NOT NULL DEFAULT '0.00000',
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`product_token` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `product_token` (`product_token`)
) ENGINE=InnoDB AUTO_INCREMENT=64851218 DEFAULT CHARSET=latin1;
and
products has only 12 records
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`code` varchar(30) NOT NULL,
`name` varchar(100) NOT NULL,
`description` varchar(2000) NOT NULL,
`token` varchar(50) NOT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=latin1;
You could improve the join performance adding index
create index idx1 on api_events(product_token, created_at);
create index idx2 on products(token);
You could also trying inverting the columns ofr api_events
create index idx1 on api_events(created_at, product_token);
and trying add redundancy to product index
create index idx2 on products(token, name);
For the query as stated, you needed
api_events: INDEX(created_at, product_token)
products: INDEX(token, name)
Because the WHERE mentions api_events, the Optimizer is likely to start with that table. created_at is in the WHERE, so the index starts with that, even though starting with a 'range' is usually wrong. In this case, the pair is "covering".
Then, INDEX(token, name) is also "covering".
"Covering" indexes give a small, but widely varying, amount of performance improvement.
What happens if you group by the token instead of the name?
SELECT ae.product_token, COUNT(*) AS count_all
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY ae.product_token;
For this query, an index on api_events(created_at, product_token) will probably help.
If this is faster, then you can bring in the name using a subquery.
It seems like the criteria on created_at is very selective (looking at only the past 7 days?). That's crying out to explore an index with created_at as a leading column.
The query is also referencing the product_token column from the same table, so we can include that column in the index, to make it a covering index.
api_events_IX ON api_events ( created_at, product_token )
Using that index, we can probably avoid looking at the vast majority of the 31 million rows, and quickly narrow in on the subset of rows we actually need to look at.
Using the index, the query will still need a "Using filesort" operation to satisfy the GROUP BY.
(My guess here is that the join to the 12 rows in product doesn't exclude a lot of rows... that on the vast majority of rows in api_event the product_token refers to a row that exists in product.
Use MySQL EXPLAIN to see the query execution plan.
A further possible refinement (to test the performance of) would be to do some of the aggregation in an inline view:
SELECT SUM(s.count_all) AS count_all
, p.name
FROM ( SELECT COUNT(*) AS count_all
, ae.product_token
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732'
GROUP
BY ae.product_token
) s
JOIN products p
ON p.token = s.product_token
GROUP
BY p.name
If the assumption about product_token is misinformed, if there are lots of rows in api_event that have product_token values that don't reference a row in product ... we might take a different tack ...

MySQL group by query with subselect optimization

I have the following tables in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
`value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`value_count` bigint(20) DEFAULT NULL,
`alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`alarm_severity` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE `matching_pv_names` (
`pv_name` varchar(60) NOT NULL,
PRIMARY KEY (`pv_name`)
) ENGINE=Memory DEFAULT CHARSET=latin1;
The matching_pv_names table holds a subset of the unique events.pv_name values.
The following query runs using the 'loose index scan' optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= time_stamp_in
GROUP BY events.pv_name;
Is it possible to improve the time of this query by restricting the events.pv_name values to those in the matching_pv_names table without losing the 'loose index scan' optimization?
Try one of the below queries to limit your output to matching values found in matching_pv_names.
Query 1:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name;
Query 2:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
AND EXISTS ( select 1 from matching_pv_names pv WHERE e.pv_name = pv.pv_name )
GROUP BY e.pv_name;
Let me quote manual here, since I think it applies to your case (bold emphasis mine):
If the WHERE clause contains range predicates (...), a loose index scan looks up the first key of each group
that satisfies the range conditions, and again reads the least
possible number of keys. This is possible under the following
conditions:
The query is over a single table.
Knowing this, I believe Query 1 would not be able to use a loose index scan, but probably second query could do that. If that is still not the case, you could also give a try for third method proposed which uses a derived table.
Query 3:
SELECT e.*
FROM (
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name
) e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name;
Your query is very efficient. You can 'prove' it by doing this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Most numbers refer to "rows touched", either in the index or in the data. You will see very low numbers. If the biggest one is about the number of rows returned, that is very good. (I tried a similar query and got about 2x; I don't know why.)
With that few rows touched then either
Outputting the rows will overwhelm the run time. So, who cares about the efficiency; or
You were I/O-bound because of leapfrogging through the index (actually, the table in your case). Run it a second time; it will be fast because of caching.
The only way to speed up leapfrogging is to somehow move the desired rows next to each other. That seems unreasonable for this query.
As for playing games with another table -- Maybe. Will the JOIN significantly decrease the number of events to look at? Then Maybe. Otherwise, I say "a very efficient query is not going to get faster by adding complexity".

MySQL: best practice for querying last value before a certain date in a time series

I have the following table in MySQL:
CREATE TABLE `history` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`code` CHAR(32) NOT NULL,
`value` FLOAT NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `timestamp_code` (`timestamp`, `code`),
INDEX `code` (`code`),
INDEX `timestamp` (`timestamp`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
I would like to know what is the best practice in order to access the last available value before a certain date for a certain set of codes the most efficiently?
So far I came up with the following query:
SELECT h.* FROM history h
JOIN (
SELECT code, MAX(timestamp) as 'last_ts'
FROM history WHERE
timestamp < '2015-09-04 13:50:00' AND
code IN ('119813249', '12087792', '12087797',
'127012151', '131014335', '131014378',
'132757371', '15016059', '15016062',
'150250238', '153462747', '155802712',
'156974389', '162277696', '166330444',
'166483001', '167220356', '167264923',
'167867931', '172283682', '177539478',
'177583937', '177648754', '177649011',
'187532416', '189230667', '70273253',
'70342790', '79342386', '82460282',
'98693280', '98693380')
GROUP BY code) last_price
ON last_price.last_ts = h.timestamp
AND last_price.code = h.code
The query above works, but becomes slow as the number of entries in the table grows (100'000'000 rows).
You can download sample data to populate the table.
Create an index by code, timestamp - rather than timestamp, code. This would let mysql sort out codes before looking for the max timestamp per code - and should be much faster. Use explain for verifying that the index is used.
And if you create that index - you should not have to modify your query.

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.