MySQL group by query with subselect optimization

MySQL group by query with subselect optimization - mysql

I have the following tables in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
`value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`value_count` bigint(20) DEFAULT NULL,
`alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`alarm_severity` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE `matching_pv_names` (
`pv_name` varchar(60) NOT NULL,
PRIMARY KEY (`pv_name`)
) ENGINE=Memory DEFAULT CHARSET=latin1;
The matching_pv_names table holds a subset of the unique events.pv_name values.
The following query runs using the 'loose index scan' optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= time_stamp_in
GROUP BY events.pv_name;
Is it possible to improve the time of this query by restricting the events.pv_name values to those in the matching_pv_names table without losing the 'loose index scan' optimization?

Try one of the below queries to limit your output to matching values found in matching_pv_names.
Query 1:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name;
Query 2:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
AND EXISTS ( select 1 from matching_pv_names pv WHERE e.pv_name = pv.pv_name )
GROUP BY e.pv_name;
Let me quote manual here, since I think it applies to your case (bold emphasis mine):
If the WHERE clause contains range predicates (...), a loose index scan looks up the first key of each group
that satisfies the range conditions, and again reads the least
possible number of keys. This is possible under the following
conditions:
The query is over a single table.
Knowing this, I believe Query 1 would not be able to use a loose index scan, but probably second query could do that. If that is still not the case, you could also give a try for third method proposed which uses a derived table.
Query 3:
SELECT e.*
FROM (
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name
) e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name;

Your query is very efficient. You can 'prove' it by doing this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Most numbers refer to "rows touched", either in the index or in the data. You will see very low numbers. If the biggest one is about the number of rows returned, that is very good. (I tried a similar query and got about 2x; I don't know why.)
With that few rows touched then either
Outputting the rows will overwhelm the run time. So, who cares about the efficiency; or
You were I/O-bound because of leapfrogging through the index (actually, the table in your case). Run it a second time; it will be fast because of caching.
The only way to speed up leapfrogging is to somehow move the desired rows next to each other. That seems unreasonable for this query.
As for playing games with another table -- Maybe. Will the JOIN significantly decrease the number of events to look at? Then Maybe. Otherwise, I say "a very efficient query is not going to get faster by adding complexity".

Related

How to optimize an UPDATE and JOIN query on practically identical tables?

I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.

Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.
Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.
ALTER TABLE `customersPrimary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
ALTER TABLE `customersSecondary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
And then you can add updatedAt to your join criteria and the WHERE clause -
UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;
For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -
UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;

Plan A:
Something like this would first find the "new" rows, then add only those:
UPDATE primary
SET ...
JOIN ( SELECT ...
FROM secondary
LEFT JOIN primary
WHERE primary... IS NULL )
ON ...
Might secondary have changes? If so, a variant of that would work.
Plan B:
Better yet is to TRUNCATE TABLE secondary after it is folded into primary.

Laravel - How to optimize MIN - MAX - orderBy queries?

My code in Laravel is:
Car::selectRaw('*,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update')
->leftJoin('car_prices', 'car_prices.car_id', 'cars.id')
->groupBy('car_prices.car_id')
->orderBy('latest_update', 'desc')
->paginate(10);
It takes long time to run until throwing error:
Maximum execution time of 60 seconds exceeded
The count of records in cars table is 100,000 and 6,000,000 in car_prices.
The tables structure:
CREATE TABLE `cars` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=110001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
CREATE TABLE `car_prices` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`car_id` bigint(20) unsigned NOT NULL,
`price` decimal(8,2) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `car_prices_car_id_foreign` (`car_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5506827 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
The query:
select count(*) as aggregate
from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`;
select *,
MIN(car_prices.price) AS min_price,
MAX(car_prices.price) AS max_price,
MAX(car_prices.updated_at) AS latest_update from `cars`
left join `car_prices`
on `car_prices`.`car_id` = `cars`.`id`
group by `car_prices`.`car_id`
order by `latest_update` desc
limit 10
offset 0;
How can I optimize it? Should I cache the data? Or there is some better query than this?
My hard disk is SSD
Value of innodb_flush_log_at_trx_commit = 1
The number of writes/inserts approximately 1000/second from 10AM - 02PM and before and after this period there are much less requests.

U need to either have better car table unique index latest_update or remove ->orderBy('latest_update', 'desc') in query. and sort it after receiving the results
U can check the performance in mysql with explain
EXPLAIN SELECT * FROM car order by latest_update desc;
/// Check this https://www.exoscale.com/syslog/explaining-mysql-queries/#:~:text=the%20last%20decade.-,Explain,DELETE%20%2C%20REPLACE%20%2C%20and%20UPDATE%20.
and https://dev.mysql.com/doc/refman/5.7/en/using-explain.html#:~:text=The%20EXPLAIN%20statement%20provides%20information,%2C%20REPLACE%20%2C%20and%20UPDATE%20statements.&text=That%20is%2C%20MySQL%20explains%20how,joined%20and%20in%20which%20order.
Basically u need to optimize (better index) your DB table "car" so that it perform well
And other thing u might to try increasing execution time
In php.ini u need to set max_execution_time = 600 or something more to just check how much time it needed to complete execution.
https://www.codewall.co.uk/increase-php-script-max-execution-time-limit-using-ini_set-function/

The query you have used is not apt for such large tables. instead whenever entry coming to the table car_prices set a operation and take minimum and maximum value and store it in the cars table. or you can setup a crone for this.

In both queries,
GROUP BY cars.id
This is instead of using car_prices.car_id, which might be missing because of the LEFT JOIN.
Once you have done that, the first query (with just the COUNT) can drop the JOIN. And then the GROUP BY becomes redundant:
select count(*) as aggregate
from `cars`
The second query has issues.
With the current design, you must go through all of both tables. Ugh.
Also... If there are no prices for a given car, it will have NULL for latest_update, therefore it will sort at the end of the 100,000 rows. Given that, you may as well not display those cars; this would simplify the query enough to be better optimized.
If you need to list the cars for which you have no prices, make that a separate request in the UI. That query will be a LEFT JOIN .. IS NULL and won't need the MAX()s.
But, I am still concerned about the 10,000 pages that the user needs to paginate through.
Switch from MyISAM to InnoDB.
Toss created_at and updated_at, if you aren't using them for anything.
After that, cars is simply a mapping between id and name. This might allow you to avoid going through cars. Instead do something like
SELECT ( SELECT name FROM cars WHERE id = x.car_id ) AS name,
...
FROM ...
Another thought that whenever you add a row to car_prices, you update updated_at in cars. This would allow you to find the 10 cars entirely in cars.
Decide what you are willing to sacrifice.
More
Note: With MyISAM, a slow SELECT blocks UPDATE. With InnoDB, the can run in parallel; the SELECT uses the values before the UPDATE. Either way, the select is at some "point in time". But InnoDB allows more parallelism.
It is a tradeoff. A small slowdown in updates to achieve a big speedup on selects. (No, I don't know for sure that my suggestion is "faster")
Some further questions to analyze the tradeoff:
Disk: HDD or SSD?
Value of innodb_flush_log_at_trx_commit (after you change to InnoDB).
How much traffic? As a first cut, is the number of writes--insert/delete--more than 100/second?

Concurrent queries on composite index with order by id drastically slow

I have a table defined as follows:
| book | CREATE TABLE `book` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`provider_id` int(10) unsigned DEFAULT '0',
`source_id` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` longtext COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `provider` (`provider_id`,`source_id`),
KEY `idx_source_id` (`source_id`),
) ENGINE=InnoDB AUTO_INCREMENT=1605425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
when there are about 10 concurrent read with following sql:
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '1037122800') ORDER BY `book`.`id` ASC LIMIT 1
it becomes slow, it takes about 100 ms.
however if I changed it to
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '221630001') LIMIT 1
then it is normal, it takes several ms.
I don't understand why adding order by id makes query much slower? could anyone expain?

Try to add desired columns (Select Column Name,.. ) instead of * or Refer this.
Why is my SQL Server ORDER BY slow despite the ordered column being indexed?

I'm not a mysql expert, and not able to perform a detailed analysis, but my guess would be that because you are providing values for the UNIQUE KEY in the WHERE clause, the engine can go and fetch that row directly using an index.
However, when you ask it to ORDER BY the id column, which is a PRIMARY KEY, that changes the access path. The engine now guesses that since it has an index on id, and you want to order by id, it is better to fetch that data in PK order, which will avoid a sort. In this case though, it leads to a slower result, as it has to compare every row to the criteria (a table scan).
Note that this is just conjecture. You would need to EXPLAIN both statements to see what is going on.

Efficient MySQL query for huge set of data

Say i have a table like below:
CREATE TABLE `hadoop_apps` (
`clusterId` smallint(5) unsigned NOT NULL,
`appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`submitTime` datetime NOT NULL COMMENT 'App submission time',
`finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
`elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
KEY `hadoop_apps_ibk_finish` (`finishTime`),
KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
mysql> SELECT COUNT(*) FROM hadoop_apps;
This would return me a count 158593816
So I am trying to understand what is inefficient about the below query and how I can improve it.
mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';
Also, what's the difference between these two queries?
mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';

WHERE DATE(finishTime)='10-11-2013';
This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.
To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.
Also, you should use MySQL standard date format: YYYY-MM-DD.
WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'
What is the difference between [conditions in WHERE and HAVING clauses]?
The WHERE clause is for filtering rows.
The HAVING clause is for filtering results after applying GROUP BY.
See SQL - having VS where

If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.
I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.
Even changing to DATETIME(6) (microseconds) could be risky.
Regardless of what you do in that area, I recommend this pattern for testing:
WHERE finishTime >= '2013-10-11'
AND finishTime < '2013-10-11' + INTERVAL 1 DAY
It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)
is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.
158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.

restriction on group by column

I have the following tables and query in MySQL:
CREATE TABLE IF NOT EXISTS `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`data` json,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE matching_pv_names (
pv_name varchar(60) NOT NULL,
PRIMARY KEY (pv_name)
) ENGINE=Memory;
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= #time_stamp_in
GROUP BY events.pv_name;
The query as it stands runs efficiently with 'Using index for group-by'. Is it possible to modify it to restrict the set of pv_names it groups on to those in the matching_pv_names table and still keep the 'Using index for group-by' optimization? For example, the following query no longer uses this optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= #time_stamp_in
AND events.pv_name IN (SELECT matching_pv_names.pv_name FROM matching_pv_names)
GROUP BY events.pv_name;
Is there another way to write it so that it does?

Your first SQL can benefit from GROUP BY optimization because it uses one table only and the column that you use for GROUP BY has index on it and the only aggregate function you use is MAX(). and you use constant in your WHERE clause.
Once you add another table in the query then GROUP BY optimization cannot be applied.

You have asked about a specific optimization, but isn't the real question about efficiency?
See how well this works:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events AS e
JOIN matching_pv_names AS m USING(pv_name)
WHERE e.time_stamp <= #time_stamp_in
GROUP BY e.pv_name;
One way to compare the efficiency of two queries, even when the tables are small, is
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Historically, this construct has optimized poorly: IN ( SELECT ... ). (I don't know whether it is working poorly for your query in your version.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008