MYSQL SUM() with GROUP BY and LIMIT - mysql

I got this table
CREATE TABLE `votes` (
`item_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`vote` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`item_id`,`user_id`),
KEY `FK_vote_user` (`user_id`),
KEY `vote` (`vote`),
KEY `item` (`item_id`),
CONSTRAINT `FK_vote_item` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON UPDATE CASCADE,
CONSTRAINT `FK_vote_user` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
And I got this simple select
SELECT
`a`.`item_id`, `a`.`sum`
FROM
(SELECT
`item_id`, SUM(vote) AS `sum`
FROM
`votes`
GROUP BY `item_id`) AS a
ORDER BY `a`.`sum` DESC
LIMIT 10
Right now, with only 250 rows, there isn't a problem, but it's using filesort. The vote column has either -1, 0 or 1. But will this be performant when this table has millions or rows?
If I make it a simpler query without a subquery, then the using temporary table appears.
Explain gives (the query completes in 0.00170s):
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 33 Using filesort
2 DERIVED votes index NULL PRIMARY 8 NULL 250

No, this won't be efficient with millions of rows.
You'll have to create a supporting aggregate table which would store votes per item:
CREATE TABLE item_votes
(
item_id INT NOT NULL PRIMARY KEY,
votes UNSIGNED INT NOT NULL,
upvotes UNSIGNED INT NOT NULL,
downvotes UNSIGNED INT NOT NULL,
KEY (votes),
KEY (upvotes),
KEY (downvotes)
)
and update it each time a vote is cast:
INSERT
INTO item_votes (item_id, votes, upvotes, downvotes)
VALUES (
$item_id,
CASE WHEN $upvote THEN 1 ELSE -1 END,
CASE WHEN $upvote THEN 1 ELSE 0 END,
CASE WHEN $upvote THEN 0 ELSE 1 END
)
ON DUPLICATE KEY
UPDATE
SET votes = votes + VALUES(upvotes) - VALUES(downvotes),
upvotes = upvotes + VALUES(upvotes),
    downvotes = downvotes + VALUES(downvotes)
then select top 10 votes:
SELECT *
FROM item_votes
ORDER BY
votes DESC, item_id DESC
LIMIT 10
efficiently using an index.

But will this be performant when this table has millions or rows?
No, it won't.
If I make it a simpler query without a subquery, then the using temporary table appears.
Probably because the planner would turn it into the query you posted: it needs to calculate the sum to return the results in the correct order.
To quickly grab the top voted questions, you need to cache the result. Add a score field in your items table, and maintain it (e.g. using triggers). And index it. You'll then be able to grab the top 10 scores using an index scan.

First, you don't need the subquery, so you can rewrite your query as:
SELECT `item_id`, SUM(vote) AS `sum`
FROM `votes`
GROUP BY `item_id`
ORDER BY `a`.`sum` DESC
LIMIT 10
Second, you can build an index on votes(item_id, vote). The group by will then be an index scan. This will take time as the table gets bigger, but it should be manageable for reasonable data sizes.
Finally, with this structure of a query, you need to do a file sort for the final order by. Whether this is efficient or not depends on the number of items you have. If each item has, on average, one or two votes, then this may take some time. If you have a fixed set of items and there are only a few hundred or thousand, then then should not be a performance bottleneck, even as the data size expands.
If this summary is really something you need quickly, then a trigger with a summary table (as explained in another answer) provides a faster retrieval method.

Related

Optimize MySQL query using range and group by on large table

My table structure:
CREATE TABLE `jobs_view_stats` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`job_id` int(11) NOT NULL,
`created_at` datetime NOT NULL,
`account_id` int(11) DEFAULT NULL,
`country` varchar(2) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDX_D05BC6799FDS15210` (`job_id`),
KEY `FK_YTGBC67994591257` (`account_id`),
KEY `jobs_view_stats_created_at_id_index` (`created_at`,`id`),
CONSTRAINT `FK_YTGBC67994591257` FOREIGN KEY (`account_id`) REFERENCES `accounts` (`id`) ON DELETE SET NULL,
CONSTRAINT `job_views_jobs_id_fk` FOREIGN KEY (`job_id`) REFERENCES `jobs` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=79976587 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='New jobs views system'
This is the query:
SELECT COUNT(id) as view, job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
GROUP BY jobs_view_stats.job_id
Execution plan:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
jobs_view_stats
null
range
IDX_D05BC6799FDS15210,jobs_view_stats_created_at_id_index
jobs_view_stats_created_at_id_index
5
null
1584610
100
Using index condition; Using MRR; Using temporary; Using filesort
This query takes 4m to complete I want to reduce it to take minimum time.
In your execution plan, you are returning 1584610 rows that are then grouped by, which in turn uses a temp table to sort through and group (slow).
jobs_view_stats_created_at_id_index contains also 'id' which will make the key cardinality excessively large, could try instead adding job_id to the key as that is what you are grouping by.
I think the main issue is your where clause returns over 1.5 million rows that have to all load into a temp table to then be re-read in full to be grouped by.
You need to bite off smaller chunks i think.
I'm going to assume you are using a programming lang to envoke the DB calls (like PHP), if so you could try
SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
then when you have the job_id list
do smaller queries looping over the results from the first
SELECT count(*) FROM jobs_view_stats where job_id = *theid*
or if there are many different job_id's batch them
SELECT count(*) FROM jobs_view_stats where job_id IN('id1', id2, id3....)
for a pure MySQL resolution, I would create a temp memory table of all the job_ids as a memory table using
CREATE TEMPORARY TABLE `temptable`
(
job_id INT PRIMARY KEY
) ENGINE=MEMORY
INSERT INTO 'temptable' SELECT DISTINCT job_id
from jobs_view_stats
WHERE jobs_view_stats.created_at between '2022-11-01 00:00:00' AND '2022-11-30 23:59:59'
Then
SELECT count(*) as view, job_id FROM jobs_view_stats where job_id = (SELECT job_id FROM `temptable`)
this is all totally untested so might be typos.
Change COUNT(id) to COUNT(*) since it does not need to check id for being NOT NULL
Add
INDEX(job_id, created_at)
INDEX(created_at, job_id)
Either of those will be "covering". (I don't know which is better; the Optimizer will know.)
Drop the index (job_id), as being redundant.
Unless you specifically need (created_id, id), drop it.
For even more performance, build and maintain a Summary Table . Then perform the query against it. Such a summary table might have PRIMARY KEY(job_id, created_date) and daily counts for each day for each job.

Real-time aggregation on a table with millions of records

I'm dealing with an ever growing table which contains about 5 million records at the moment. About a 100000 new records are added daily.
The table contains information about ad campaigns, and is joined on query with another table:
CREATE TABLE `statistics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range_id` int(11) DEFAULT NULL,
`campaign_id` int(11) DEFAULT NULL,
`payout` decimal(5,2) DEFAULT NULL,
`is_converted` tinyint(1) unsigned NOT NULL DEFAULT '0',
`converted` datetime DEFAULT NULL,
`created` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `created` (`created`),
KEY `converted` (`converted`),
KEY `campaign_id` (`campaign_id`),
KEY `ip_range_id` (`ip_range_id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The other table contains IP ranges:
CREATE TABLE `ip_ranges` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range` varchar(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ip_range` (`ip_range`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The aggregation query is as follows:
SELECT
SUM(`payout`) AS `revenue`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id`) AS `clicks`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id` AND `is_converted` = 1) AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20
The query takes about 20 seconds to complete.
This is what EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY ip_range index PRIMARY PRIMARY 4 NULL 306552 Using index; Using temporary; Using filesort
1 PRIMARY statistic ref ip_range_id ip_range_id 5 db.ip_range.id 8 Using where
3 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where
2 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where; Using index
Caching the clicks and conversions in the ip_ranges table as extra columns is not an option, because I need to be able to also filter on the campaign_id column (and possibly other columns in the future). So these aggregations need to be somewhat real-time.
What is the best strategy to do aggregation on large tables on multiple dimensions and near real-time?
Note that I'm not necessarily looking to just make the query better, but I'm also interested in strategies that might involve other database systems (NoSQL) and/or distributing the data over different servers, etc
Your query looks overly complicated. There is no need to query the same table again and again:
select
sum(payout) as revenue,
count(*) as clicks,
sum(s.is_converted = 1) as conversions
from ip_ranges r
inner join statistics s on r.id = s.ip_range_id
group by r.id
order by clicks desc
limit 20;
EDIT (after acceptance): As to your actual question on how to deal with a task like this:
You want to look at all the data in your table and you want your result to be up-to-date. Then there is no other option than to read all data (full table scans). If the tables are wide (i.e. have many columns) you may want to create covering indexes (i.e. indexes that contain all columns involved), so instead of reading the table, the index would be read. Well, what else? On full table scans it is recommendable to use parallel access, which MySQL doesn't provide, as far as I know. So you might want to switch to another DBMS. Then see what else the DBMS offers. Maybe the parallel querying would benefit from partitioning the tables. The last thing that comes to mind is hardware, i.e. more CPUs, faster drives etc.
Another option might be to remove old data from your tables. Say you need the details of the current year, but only the aggregated data for previous years. Then have another table old_statistics holding only the sums and counts needed, e.g.
table old_statistics
(
ip_range_id,
revenue,
conversions
);
Then you'd aggregate the data from statistics, which would be much smaller then, because it would hold only data of the current year, and add old_statistics to get the results.
Try this
SELECT
SUM(`payout`) AS `revenue`,
SUM(case when `ip_range_id` = `IpRange`.`id` then 1 else 0 end) AS `clicks`,
SUM(case when `ip_range_id` = `IpRange`.`id` and `is_converted` = 1 then 1 else 0 end)
AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.

MySQL Query Optimization for GPS Tracking system

I have the following query:
SELECT * FROM `alltrackers`
WHERE `deviceid`='FT_99000083401624'
AND `locprovider`!='none'
ORDER BY `id` DESC
This is the show create table:
CREATE TABLE `alltrackers` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`deviceid` varchar(50) NOT NULL,
`gpsdatetime` int(11) NOT NULL,
`locprovider` varchar(30) NOT NULL,
PRIMARY KEY (`id`),
KEY `statename` (`statename`),
KEY `gpsdatetime` (`gpsdatetime`),
KEY `locprovider` (`locprovider`),
KEY `deviceid` (`deviceid`(18))
) ENGINE=MyISAM AUTO_INCREMENT=8665045 DEFAULT CHARSET=utf8;
I've removed the columns which I thought were unnecessary for this question.
This is the EXPLAIN output for this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE alltrackers ref locprovider,deviceid deviceid 56 const 156416 Using
where; Using filesort
This particular query is showing as taking several seconds in mytop (mtop). I'm a bit confused though, as the same query but with a different "deviceid" doesn't take as long. Although I only need the last row, I've already removed LIMIT 1 as that makes it take even longer. This table currently contains 3 million rows.
It is used for storing the locations from different GPS devices. Each GPS device has a unique device ID. Locations come in and are added to the table. For statistics I'm running the above query to find the time of the last received location from a certain device.
I'm open to advice on ways to further optimize the query or even the tables.
Many thanks in advance.
If you only need the last row, add an index on (deviceid, id, locprovider). It would be even faster with an index on (deviceid, id, locprovider, gpsdatetime):
ALTER TABLE alltrackers
ADD INDEX special_covering_IDX
(deviceid, id, locprovider, gpsdatetime) ;
Then try this out:
SELECT id, locprovider, gpsdatetime
FROM alltrackers
WHERE deviceid = 'FT_99000083401624'
AND locprovider <> 'none'
ORDER BY id DESC
LIMIT 1 ;

MySQL query need optimization

I got this query:
SELECT user_id
FROM basic_info
WHERE age BETWEEN 18 AND 22 AND gender = 0
ORDER BY rating
LIMIT 50
The table looks like (and it contains about 700k rows):
CREATE TABLE IF NOT EXISTS `basic_info` (
`user_id` mediumint(8) unsigned NOT NULL auto_increment,
`gender` tinyint(1) unsigned NOT NULL default '0',
`age` tinyint(2) unsigned NOT NULL default '0',
`rating` smallint(5) unsigned NOT NULL default '0',
PRIMARY KEY (`user_id`),
KEY `tmp` (`gender`,`rating`),
) ENGINE=MyISAM;
The query itself is optimized but it has to walk about 200k rows to do his job.
Here's the explain output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE basic_info ref tmp,age tmp 1 const 200451 Using where
Is it possible to optimize the query so it won't walk over 200k rows ?
Thanks !
There are two useful indexes that can help this query:
KEY gender_age (gender, age) -- this index can satisfy both the gender=0 condition as well as age BETWEEN 18 AND 22. However, because you have a range condition over the age field, adding the rating column to the index will not give sorted results -- hence MySQL will select all matching rows -- ignoring your LIMIT clause -- and do an additional filesort regardless.
KEY gender_rating (gender, rating) -- the index you already have; this index can satisfy the gender=0 condition and retrieves data already sorted by rating. However, the database has to scan all elements with gender=0 and eliminate those who are not in range age BETWEEN 18 AND 22
Changing schema
If the above does not help you enough, changing your schema is always possible. One such approach is turning the age BETWEEN condition into an equality condition, by defining an age group column; for instance, ages 0-12 will be in age group 1, ages 12-18 in age group 2, etc.
This way, having an index with (gender, agegroup, rating) and query with WHERE gender=0 AND agegroup=3 ORDER BY rating will retrieve all results from the index and already sorted. In this case, the LIMIT clause should only fetch 50 entries from the table and no more.
Extend you tmp-key to include the age-column:
KEY `tmp` (`age`,`gender`,`rating`)
Attempt to use InnoDB to improve performence?
Benchmarking here