Optimize COUNT(*) - mysql

I have a table items from which I'm selecting 40 rows at a time ordered by the popularity of the item.
The popularity score is simply downloads/impressions;
Query:
SELECT id, name
FROM items
ORDER BY (SELECT COUNT(*) FROM downloads WHERE item = items.id)/
(SELECT COUNT(*) FROM impressions WHERE item = items.id)
LIMIT 40;
The problem is that the query takes forever to complete (ranging from 2 to 10 seconds).
At the moment we have 25K items, 18M impressions, and 560k download.
We already tried adding the fields downloads and impressions in the table items and keeping the count updated using triggers (after an insert in the tables impressions and downloads we increment the values), but we've had some issues with deadlocking.
Is there a better way to optimize this query?
Thanks.
Edit
Here's the output of EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY items ALL NULL NULL NULL NULL 20496 Using filesort
3 DEPENDENT SUBQUERY impressions ref PRIMARY PRIMARY 4 db.items.id 74 Using index
2 DEPENDENT SUBQUERY downloads ref PRIMARY PRIMARY 4 db.items.id 274 Using index
Tables:
CREATE TABLE `items` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(35) DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24369 DEFAULT CHARSET=utf8mb4;
CREATE TABLE `impressions` (
`item` int(10) unsigned NOT NULL,
`user` char(36) NOT NULL DEFAULT '',
PRIMARY KEY (`item`,`user`),
CONSTRAINT `impression_ibfk_1` FOREIGN KEY (`item`) REFERENCES `items` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `downloads` (
`item` int(10) unsigned NOT NULL,
`user` char(36) NOT NULL DEFAULT '',
PRIMARY KEY (`item`,`user`),
CONSTRAINT `download_ibfk_1` FOREIGN KEY (`item`) REFERENCES `items` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

I think next query can resolve your problem:
SELECT
item,items.name, downloads.cnt/impressions.cnt AS rate
FROM (
SELECT item, COUNT(*) AS cnt FROM downloads GROUP BY item
) AS downloads
JOIN (
SELECT item, COUNT(*) AS cnt FROM impressions GROUP BY item
) impressions
JOIN items ON items.id = downloads.items
ORDER BY rate DESC
LIMIT 40;
Also care about downloads and impressions tables have indexed by item field.

Not solvable with that approach.
There are two solutions:
Keep counters (by item.id) for impressions and downloads.
Summary tables.
Counters This involves adding an extra column for each counter to the items table. Or building a parallel table with id and the various counters. For really high volumn of counts, the latter avoids some clashes between various queries.
Summary Tables Build and incrementally augment a table(s) that summarize counts like these, plus perhaps other SUMs, COUNTs, etc. The table would perhaps be augmented daily for the previous day's information. Then the "sum the counts" to get the grand total; this will be much faster than your current query.
More on Summary Tables: http://mysql.rjweb.org/doc.php/summarytables

I'd count downloads and impressions first and then get the top 40:
with d as (select item, count(*) as total from downloads group by item)
, i as (select item, count(*) as total from impressions group by item)
, top40 as select item from d join i using (item) order by d.total / i.total limit 40)
select *
from items
where id in
(
select item from top40
);
The WITH clause is available as of MySQL 8. In earlier versions, you'd work with subqueries instead.
As item is a foreign key in downloads and impressions and id is the primary key in items, I suppose there is an index on them. Otherwise create one:
create unique index idx1 on items(id);
create index idx2 on downloads(item);
create index idx3 on impressions(item);

Related

Sum of averages raw query

I have the following code that I have to optimize:
These are the models:
class Question(models.Model):
question_id = models.CharField(max_length=20)
label = models.CharField(max_length=255, verbose_name='Question')
class Property(models.Model):
name = models.CharField(max_length=200)
class Response(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
submit_date = models.DateTimeField()
score = models.IntegerField(null=True, blank=True)
is_null = models.BooleanField(default=False)
ignore = models.BooleanField(default=False)
property = models.ForeignKey(Property, on_delete=models.CASCADE)
class Plan(models.Model):
name = models.CharField(max_length=100)
questions = models.ManyToManyField(Question, through='PlanQuestion')
start_date = models.DateField(null=True)
completion_date = models.DateField(null=True)
class PlanQuestion(models.Model):
question = models.ForeignKey(Question, on_delete=models.CASCADE)
plan = models.ForeignKey(Plan, on_delete=models.CASCADE)
I first iterate over the plans then plan questions like this:
plans = Plan.objects.filter(
start_date__isnull=False, completion_date__isnull=False
)
for plan in plans:
plan_questions = plan.questions.through.objects.filter(plan=plan)
for plan_question in plan_questions:
# run the below query for each plan_question here
In the above code for each plan question this query is run to calculate the average of score:
SELECT AVG(score) AS average_score
FROM Response WHERE question_id=%(question_id)s
AND DATE(submit_date) >= %(stard_date)s AND DATE(submit_date) <= %(end_date)s
The problem is that:
If let us say Plan1 has 5 questions:
P1 => Avg(Q1) + Avg(Q2) + Avg(Q3) + Avg(Q4) + Avg(Q5)
The query is run for each question which calculates the average score for each response (one question can have many responses) so for P1, 5 queries are run, and let us say it takes 0.5 seconds to execute one query then it would take 2.5 seconds (5 * 0.5) to run 5 queries for one plan. Now If we increase the number of Plans each having 5 questions then it would increase the time exponentially.
I want a way to reduce the number of these queries so that for each question I don't have to run queries separately. How to combine all the queries of question in one query ?. Maybe I can use union but I don't get how would I write a single query using that or maybe there might be a better solution than a union.
I also tried to add prefech_related but that did no improvement.
Edit:
Create Tables:
CREATE TABLE `Response` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question_id` int(11) NOT NULL,
`score` int(11) DEFAULT NULL
PRIMARY KEY (`id`),
KEY `Response_25110688` (`question_id`),
CONSTRAINT `question_id_refs_id_2dd82bdb` FOREIGN KEY (`question_id`) REFERENCES `Question` (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=157533450 DEFAULT CHARSET=latin1
CREATE TABLE `Question` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`question_id` varchar(20) NOT NULL,
`label` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=353 DEFAULT CHARSET=latin1
CREATE TABLE `Plan` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_date` date DEFAULT NULL,
`completion_date` date DEFAULT NULL
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=687 DEFAULT CHARSET=latin1
CREATE TABLE `PlanQuestion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`plan_id` int(11) NOT NULL,
`question_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `PlanQuestion_plan_id_de8df699_fk_Plan_id` (`plan_id`),
KEY `PlanQuestion_question_id_49c10d5b_fk_Question_id` (`question_id`),
CONSTRAINT `PlanQuestion_plan_id_de8df699_fk_Plan_id` FOREIGN KEY (`plan_id`) REFERENCES `Plan` (`id`),
CONSTRAINT `PlanQuestion_question_id_49c10d5b_fk_Question_id` FOREIGN KEY (`question_id`) REFERENCES `Question` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2130 DEFAULT CHARSET=latin1
CREATE TABLE `Property` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=188651 DEFAULT CHARSET=latin1
Here is the full query:
SELECT id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response WHERE question_id=%(question_id)s
AND property_id=%(property_id)s and is_null=0
AND Response.ignore=0 AND DATE(submit_date) >= %(stard_date)s
AND DATE(submit_date) <= %(end_date)s
This does not make a lot of sense:
SELECT id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response
WHERE question_id=%(question_id)s
AND DATE(submit_date) >= %(stard_date)s
AND DATE(submit_date) <= %(end_date)s
Without a GROUP BY, the COUNT and AVG will be totals for the one "question_id". But then if there is a different id for each row, which id are you hoping for?
OK, assuming id is removed, it needs this composite index with the columns in this order:
INDEX(question_id, submit_date)
Meanwhile, remove INDEX(question_id) because it will be in the way.
Sorry, but sometimes performance requires changes.
Secondly... "for plan_question in plan_questions" implies that you want that to be run for every "question"?
Then get rid of the loop and do all the work at the same time:
SELECT question_id, COUNT(*) AS count, AVG(int_val) AS int_average
FROM Response
WHERE DATE(submit_date) >= %(start_date)s
AND DATE(submit_date) <= %(end_date)s
GROUP BY question_id
This will return one row per question; then you can loop through the resultset to deliver the output.
Good news: Even if you don't add the above index, this will work better than what you have now.
Also... cur_date = datetime.now().date() could be removed from the app code; instead, use simply CURDATE() in SQL to get just the date or NOW() to get the date+time.
Indexing Getting rid of "for plan_question in plan_questions" will be the biggest benefit. The query (as I wrote it) already benefits from the index on question_id. However, adding INDEX(submit_date) might run faster if the date range is narrow.
If there are other clauses in the WHERE, we need to see them. There may be other indexes to suggest.
More
SELECT id, COUNT(*) AS count
FROM response
-- (and not JOINing to any other tables)
GROUP BY id;
This query always has a count of 1 because each id occurs in response exactly once.
SELECT
-- (without id)
COUNT(*) AS count
FROM response
-- (and not JOINing to any other tables)
-- (without GROUP BY)
;
This query always returns exactly 1 row.
Still More
Based on
WHERE question_id=%(question_id)s
AND property_id=%(property_id)s and is_null=0
AND Response.ignore=0 AND DATE(submit_date)...
you need
INDEX(question_id, property_id, is_null, ignore)
and drop INDEX(question_id).
But... My statement about doing a single query instead of an app loop still stands.
JOINing to Plan
SELECT r.question_id,
COUNT(*) AS count,
AVG(r.int_val) AS int_average,
p.plan -- perhaps you want to say which "plan" is involved?
FROM Plans AS p
JOIN PLanQuestions AS pq ON pq.plan_id = p.plan_id
JOIN Responses AS r ON r.question_id = pq.question_id
WHERE p.... -- optionally filter on which plans to include
AND pq.... -- optionally filter on the other columns in pq
AND r.... -- optionally filter on which responses to include
ORDER BY ... -- optionally sort the results by any column(s) in any table(s)
And remove the two single-column indexes in PlanQuestions, replacing them by two 2-column indexes:
INDEX(plan_id, question_id),
INDEX(question_id, plan_id)
AND DATE(submit_date) <= %(end_date)s
GROUP BY question_id
Sargable
DATE(submit_date) >= "..." is "not sargable" This means that an index involving col cannot help with the test. Since submit_date is of datatype DATE, this is semantically identical and faster:
submit_date >= "..."

How to optimize query for Max(Date) in MySQL

I have this SQL Query:
SELECT company.*, salesorder.lastOrderDate
FROM company
INNER JOIN
(
SELECT companyId, MAX(orderDate) AS lastOrderDate
FROM salesorder
GROUP BY companyId
) salesorder ON salesorder.companyId = company.companyId;
This gives me one extra column at the end of a company master table with their last order date.
Problem is, when analyzing this query, it seems like it's not that efficient:
Is there a way to make this more efficient?
salesorder:
orderId, companyId, orderDate
1 333 2015-01-01
2 555 2016-01-01
3 333 2017-01-01
company
companyId, name
333 Acme
555 Microsoft
Query:
companyId, name, lastOrderDate
333 Acme 2017-01-01
555 Microsoft 2016-01-01
EXPLAIN SELECT:
CREATE TABLE `salesorder` (
`orderId` int(11) NOT NULL,
`companyId` int(11) DEFAULT NULL,
`orderDate` date DEFAULT NULL,
PRIMARY KEY (`orderId`),
UNIQUE KEY `orderId_UNIQUE` (`orderId`) /*!80000 INVISIBLE */,
KEY `testComposite` (`companyId`,`orderDate`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
CREATE TABLE `company` (
`companyId` int(11) NOT NULL,
`name` varchar(45) DEFAULT NULL,
PRIMARY KEY (`companyId`),
UNIQUE KEY `companyId_UNIQUE` (`companyId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
It looks like you could simplify the query like this:
SELECT c.*, MAX(o.OrderDate) As lastOrderDate
FROM company c
INNER JOIN salesorder o on o.companyId = c.companyId
GROUP BY <list all company fields here>;
MySql might even let you get away with just c.companyId in the GROUP BY clause, but that's not really standard and not great practice.
Add the composite index with the columns in this order:
INDEX(companyId, orderDate)
Single column indexes are not as efficient (in this query).
Since a PRIMARY KEY is a unique key, do not redundantly declare a UNIQUE key.
With only a few rows in the table, you cannot trust EXPLAIN (and Explain-like output) to say how bad the query will be. Try it with at least a few dozen rows. And provide EXPLAIN FORMAT=JSON SELECT ...
Note that it says "Using index". That says that the subquery in question can be performed entirely inside the index's BTree. This is 'good'. (I presume you did the EXPLAIN after adding my suggested index?)
Your previous image showed a lot of rows; what gives?
I'm still puzzled as to why there are 3 rows in the EXPLAIN and two table scans. Anyway, here is another formulation to try:
SELECT c.*,
( SELECT MAX(orderDate)
FROM salesorder
WHERE companyId = c.companyId
) AS lastOrderDate
FROM company AS c;
(and my INDEX is still important)

Mysql query not optimized and very slow, but why?

in the software that i develop, a car delear software, there's a section with the agenda with all the appointments of the users.
This section is pretty fast to load with a daily and normal use of the agenda, thousands of rows, but start to be really slow when the agenda tables reach 1 million of rows.
The structure:
1) Main table
CREATE TABLE IF NOT EXISTS `agenda` (
`id_agenda` int(11) NOT NULL AUTO_INCREMENT,
`id_user` int(11) NOT NULL DEFAULT '0',
`id_agency` int(11) NOT NULL DEFAULT '0',
`id_customer` int(11) DEFAULT NULL,
`id_car` int(11) DEFAULT NULL,
`id_owner` int(11) DEFAULT NULL,
`type` int(11) NOT NULL DEFAULT '8',
`title` varchar(255) NOT NULL DEFAULT '',
`text` text NOT NULL,
`start_day` date NOT NULL DEFAULT '0000-00-00',
`end_day` date NOT NULL DEFAULT '0000-00-00',
`start_hour` time NOT NULL DEFAULT '00:00:00',
`end_hour` time NOT NULL DEFAULT '00:00:00'
PRIMARY KEY (`id_agenda`),
KEY `start_day` (`start_day`),
KEY `id_customer` (`id_customer`),
KEY `id_car` (`id_car`),
KEY `id_user` (`id_user`),
KEY `id_owner` (`id_owner`),
KEY `type` (`type`),
KEY `id_agency` (`id_agency`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ;
2) Secondary table
CREATE TABLE IF NOT EXISTS `agenda_cars` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_agenda` int(11) NOT NULL,
`id_car` int(11) NOT NULL,
`id_owner` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `id_agenda` (`id_agenda`),
KEY `id_car` (`id_car`),
KEY `id_owner` (`id_owner`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Query:
SELECT a.id_agenda
FROM agenda as a
LEFT JOIN agenda_cars as agc on agc.id_agenda = a.id_agenda
WHERE
(a.id_customer = '22' OR (a.id_owner = '22' OR agc.id_owner = '22' ))
GROUP BY a.id_agenda
ORDER BY a.start_day, a.start_hour
Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index PRIMARY PRIMARY 4 NULL 1051987 Using temporary; Using filesort
1 SIMPLE agc ref id_agenda id_agenda 4 db.a.id_agenda 1 Using where
The query reachs 10 secs to end, with the id 22, but with other id can reach also 20 secs, this just for the query, to load all in the web page take of course more time.
I don't get the point why it takes so long to get the data, i think the indexes are right configured and the query is pretty simple, so why?
Too much data?
I've solved in this way:
SELECT a.id_agenda
FROM
(
SELECT id_agenda
FROM agenda
WHERE (id_customer = '22' OR id_owner = '22' )
UNION
SELECT id_agenda
FROM agenda_cars
WHERE id_owner = '22'
) as at
INNER JOIN agenda as a on a.id_agenda = at.id_agenda
GROUP BY a.id_agenda
ORDER BY a.start_day, a.start_hour
This version of the query is ten times faster the then previous...but why?
Thanks to all want to contribute to solve my doubts!
UPDATE AFTER Rick James solution:
Query suggested
SELECT a.id_agenda
FROM
(
SELECT id_agenda FROM agenda WHERE id_customer = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda WHERE id_owner = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda_cars WHERE id_owner = '22'
) as at
INNER JOIN agenda as a ON a.id_agenda = at.id_agenda
ORDER BY a.start_datetime;
Result: 279 total, 0.0111 sec
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 366 Using temporary; Using filesort
1 PRIMARY a eq_ref PRIMARY PRIMARY 4 at.id_agenda 1 NULL
2 DERIVED agenda ref id_customer id_customer 5 const 1 Using index
3 UNION agenda ref id_owner id_owner 5 const 114 Using index
4 UNION agenda_cars ref id_owner id_owner 4 const 250 NULL
NULL UNION RESULT <union2,3,4> ALL NULL NULL NULL NULL NULL Using temporary
Before I dig into what can be done, let me list several reg flags I see.
OR is hard to optimize
Filtering (WHERE) on multiple tables JOINed together is hard to optimize.
GROUP BY x ORDER BY z means two passes over the data, usually 2 temp tables and filesorts.
Did you really mean LEFT? It says "the right table (agc) might be missing, in which case provide NULLs".
(You may not be able to get rid of all of the red flags.)
Red flags in the Schema:
Indexing every column -- usually not useful
Only single-column indexes -- "composite" indexes often help.
DATE and TIME as separate columns -- usually makes for clumsy queries.
OK, those are off my shoulder, now to study the query... (Oh, and thanks for providing the CREATEs and EXPLAIN!)
The ON implies a 1:many relationship between agenda:agenda_cars. Is that correct?
id_owner and id_car are in both tables, yet are not included in the ON; what's up?
(Here's the meat of the answer to your final question.) Why have GROUP BY? I see no aggregates. I will guess that the 1:many relationship lead to multiple rows, and you needed to de-dup? For dedupping, please use DISTINCT. But, the real solution is to avoid the "inflate (JOIN) - deflate (GROUP BY)" syndrome. Your subquery is a good start on that.
Rolling some of the above comments in, plus more:
SELECT a.id_agenda
FROM
(
SELECT id_agenda FROM agenda WHERE id_customer = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda WHERE id_owner = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda_cars WHERE id_owner = '22'
) as at
INNER JOIN agenda as a ON a.id_agenda = at.id_agenda
ORDER BY a.start_datetime;
Notes:
Got rid of the other OR
Explicit UNION DISTINCT to be clear that dups are expected.
Toss GROUP BY and not using SELECT DISTINCT; UNION DISTINCT deals with the need.
You have the 4 necessary indexes (one per subquery): (id_customer), (id_owner) (on both tables) and PRIMARY KEY(id_agenda).
The indexes are "covering indexes for all the subqueries -- an extra bonus.
There will be one unavoidable tmp table and file sort -- for the ORDER BY, but it won't be on a million rows.
(No need for composite indexes -- this time.)
I changed to a DATETIME; change back if you have a good reason for splitting them.
Did I get you another 10x? Did I explain it sufficiently?
Oh, one more thing...
This query returns an list of ids ordered by something that it does not return (date+time). What will you do with ids? If you are using this as a subquery in another table, then the Optimizer has a right to throw away the ORDER BY. Just warning you.

Why is my MySQL group by so slow?

I am trying to query against a partitioned table (by month) approaching 20M rows. I need to group by DATE(transaction_utc) as well as country_id. The rows that get returned if i turn off the group by and aggregates is just over 40k, which isn't too many, however adding the group by makes the query substantially slower unless said GROUP BY is on the transaction_utc column, in which case it gets FAST.
I've been trying to optimize this first query below by tweaking the query and/or the indexes, and got to the point below (about 2x as fast as initially) however still stuck with a 5s query for summarizing 45k rows, which seems way too much.
For reference, this box is a brand new 24 logical core, 64GB RAM, Mariadb-5.5.x server with way more INNODB buffer pool available than index space on the server, so shouldn't be any RAM or CPU pressures.
So, I'm looking for ideas on what is causing this slow down and suggestions on speeding it up. Any feedback would be greatly appreciated! :)
Ok, onto the details...
The following query (the one I actually need) takes approx 5 seconds (+/-), and returns less than 100 rows.
SELECT lss.`country_id` AS CountryId
, Date(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`, DATE(lss.`transaction_utc`)
EXPLAIN SELECT for the same query is as follows. Notice that it's not using the transaction_utc key. Shouldn't it be using my covering index instead?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE lss ref idx_unique,transaction_utc,country_id idx_unique 50 const 1208802 Using where; Using temporary; Using filesort
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 georiot.lss.country_id 1
Now onto a couple other options that I've tried to attempt to determine whats going on...
The following query (changed group by) takes about 5 seconds (+/-), and returns only 3 rows:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`
The following query (removed group by) takes 4-5 seconds (+/-) and returns 1 row:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
The following query takes .00X seconds (+/-) and returns ~45k rows. This to me shows that at max we're only trying to group 45K rows into less than 100 groups (as in my initial query):
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
GROUP BY lss.`transaction_utc`
TABLE SCHEMA:
CREATE TABLE IF NOT EXISTS `sales` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_linkshare_account_id` int(11) unsigned NOT NULL,
`username` varchar(16) NOT NULL,
`country_id` int(4) unsigned NOT NULL,
`order` varchar(16) NOT NULL,
`raw_tracking_code` varchar(255) DEFAULT NULL,
`transaction_utc` datetime NOT NULL,
`processed_utc` datetime NOT NULL ,
`sku` varchar(16) NOT NULL,
`sale_original` decimal(10,4) NOT NULL,
`sale_usd` decimal(10,4) NOT NULL,
`quantity` int(11) NOT NULL,
`commission_original` decimal(10,4) NOT NULL,
`commission_usd` decimal(10,4) NOT NULL,
`original_currency` char(3) NOT NULL,
PRIMARY KEY (`id`,`transaction_utc`),
UNIQUE KEY `idx_unique` (`username`,`order`,`processed_utc`,`sku`,`transaction_utc`),
KEY `raw_tracking_code` (`raw_tracking_code`),
KEY `idx_usd_amounts` (`sale_usd`,`commission_usd`),
KEY `idx_countries` (`country_id`),
KEY `transaction_utc` (`transaction_utc`,`username`,`country_id`,`sale_usd`,`commission_usd`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE ( TO_DAYS(`transaction_utc`))
(PARTITION pOLD VALUES LESS THAN (735112) ENGINE = InnoDB,
PARTITION p201209 VALUES LESS THAN (735142) ENGINE = InnoDB,
PARTITION p201210 VALUES LESS THAN (735173) ENGINE = InnoDB,
PARTITION p201211 VALUES LESS THAN (735203) ENGINE = InnoDB,
PARTITION p201212 VALUES LESS THAN (735234) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ AUTO_INCREMENT=19696320 ;
The offending part is probably the GROUP BY DATE(transaction_utc). You also claim to have a covering index for this query but I see none. Your 5-column index has all the columns used in the query but not in the best order (which is: WHERE - GROUP BY - SELECT).
So, the engine, finding no useful index, would have to evaluate this function for all the 20M rows. Actually, it finds an index that starts with username (the idx_unique) and it uses that, so it has to evaluate the function for (only) 1.2M rows. If you had a (transaction_utc) or a (username, transaction_utc) it would choose the most useful of the three.
Can you afford to change the table structure by splitting the column into date and time parts?
If you can, then an index on (username, country_id, transaction_date) or (changing the order of the two columns used for grouping), on (username, transaction_date, country_id) would be quite efficient.
A covering index on (username, country_id, transaction_date, sale_usd, commission_usd) even better.
If you want to keep the current structure, try changing the order inside your 5-column index to:
(username, country_id, transaction_utc, sale_usd, commission_usd)
or to:
(username, transaction_utc, country_id, sale_usd, commission_usd)
Since you are using MariaDB, you can use the VIRTUAL columns feature, without changing the existing columns:
Add a virtual (persistent) column and the appropriate index:
ALTER TABLE sales
ADD COLUMN transaction_date DATE NOT NULL
AS DATE(transaction_utc)
PERSISTENT
ADD INDEX special_IDX
(username, country_id, transaction_date, sale_usd, commission_usd) ;

MySQL JOIN time reduction

This query is taking over a minute to complete:
SELECT keyword, count(*) as 'Number of Occurences'
FROM movie_keyword
JOIN
keyword
ON keyword.`id` = movie_keyword.`keyword_id`
GROUP BY keyword
ORDER BY count(*) DESC
LIMIT 5
Every keyword has an ID associated with it (keyword_id column). And that ID is used to look up the actual keyword from the keyword table.
movie_keyword has 2.8 million rows
keyword has 127,000
However to return just the most used keyword_id's takes only 1 second:
SELECT keyword_id, count(*)
FROM movie_keyword
GROUP BY keyword_id
ORDER BY count(*) DESC
LIMIT 5
Is there a more efficient way of doing this?
Output with EXPLAIN:
1 SIMPLE keyword ALL PRIMARY NULL NULL NULL 125405 Using temporary; Using filesort
1 SIMPLE movie_keyword ref idx_keywordid idx_keywordid 4 imdb.keyword.id 28 Using index
Structure:
CREATE TABLE `movie_keyword` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`movie_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_mid` (`movie_id`),
KEY `idx_keywordid` (`keyword_id`),
KEY `keyword_ix` (`keyword_id`),
CONSTRAINT `movie_keyword_keyword_id_exists` FOREIGN KEY (`keyword_id`) REFERENCES `keyword` (`id`),
CONSTRAINT `movie_keyword_movie_id_exists` FOREIGN KEY (`movie_id`) REFERENCES `title` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4256379 DEFAULT CHARSET=latin1;
CREATE TABLE `keyword` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`keyword` text NOT NULL,
`phonetic_code` varchar(5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_keyword` (`keyword`(5)),
KEY `idx_pcode` (`phonetic_code`),
KEY `keyword_ix` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=127044 DEFAULT CHARSET=latin1;
Untested but should work and be significantly faster in my opinion, not very sure if you're allowed to use limit in a subquery in mysql though, but there are other ways around that.
SELECT keyword, count(*) as 'Number of Occurences'
FROM movie_keyword
JOIN
keyword
ON keyword.`id` = movie_keyword.`keyword_id`
WHERE movie_keyword.keyword_id IN (
SELECT keyword_id
FROM movie_keyword
GROUP BY keyword
ORDER BY count(*) DESC
LIMIT 5
)
GROUP BY keyword
ORDER BY count(*) DESC;
This should be faster because you don't join all the 2.8 million entries in movie_keyword with keyword, just the ones that actually match, which I'm guessing are significantly less.
EDIT since mysql doesn't support limit inside a subquery you have to run
SELECT keyword_id
FROM movie_keyword
GROUP BY keyword
ORDER BY count(*) DESC
LIMIT 5;
first and after fetching the results run the second query
SELECT keyword, count(*) as 'Number of Occurences'
FROM movie_keyword
JOIN
keyword
ON keyword.`id` = movie_keyword.`keyword_id`
WHERE movie_keyword.keyword_id IN (RESULTS_FROM_FIRST_QUERY_SEPARATED_BY_COMMAS)
GROUP BY keyword
ORDER BY count(*) DESC;
replace RESULTS_FROM_FIRST_QUERY_SEPARATED_BY_COMMAS with the proper values programatically from whatever language you're using
The query seems fine but I think the structure is not, try to give index on columns
keyword.id
try,
CREATE INDEX keyword_ix ON keyword (id);
or
ALTER TABLE keyword ADD INDEX keyword_ix (id);
much better if you can post the structures of your tables: keyword and Movie_keyword. Which of the two is the main table and the referencing table?
SELECT keyword, count(movie_keyword.id) as 'Number of Occurences'
FROM movie_keyword
INNER JOIN keyword
ON keyword.`id` = movie_keyword.`keyword_id`
GROUP BY keyword
ORDER BY 'Number of Occurences' DESC
LIMIT 5
I know this is pretty old question, but because I think that xception forgot about delivery tables in mysql, I want to suggest another solution. It requires only one query and it omits joining big data. If someone has such big data and can test it ( maybe question creator ), please share results.
SELECT keyword.keyword, _temp.occurences
FROM (
SELECT keyword_id, COUNT( keyword_id ) AS occurences
FROM movie_keyword
GROUP BY keyword_id
ORDER BY occurences DESC
LIMIT 5
) AS _temp
JOIN keyword ON _temp.keyword_id = keyword.id
ORDER BY _temp.occurences DESC