I have a table to store data from csv files. It is a large table (over 40 million rows). This is its structure:
CREATE TABLE `imported_lines` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`day` date NOT NULL,
`name` varchar(256) NOT NULL,
`origin_id` int(11) NOT NULL,
`time` time(3) NOT NULL,
`main_index` tinyint(4) NOT NULL DEFAULT 0,
`transaction_index` tinyint(4) NOT NULL DEFAULT 0,
`data` varchar(4096) NOT NULL,
`error` bit(1) NOT NULL,
`expressions_applied` bit(1) NOT NULL,
`count_records` smallint(6) NOT NULL DEFAULT 0,
`client_id` tinyint(4) NOT NULL DEFAULT 0,
`receive_date` datetime(3) NOT NULL,
PRIMARY KEY (`id`,`client_id`),
UNIQUE KEY `uq` (`client_id`,`name`,`origin_id`,`receive_date`),
KEY `dh` (`day`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (`client_id`) PARTITIONS 15 */
When I perform a SELECT with one day filter, it returns data very quick (0.4 s). But, as I increase date range, it becomes slow, until gets a timeout error.
This is the query:
SELECT origin_id, error, main_index, transaction_index,
expressions_applied, name, day,
COUNT(id) AS total, SUM(count_records) AS sum_records
FROM imported_lines FORCE INDEX (dh)
WHERE client_id = 1
AND day >= '2017-07-02' AND day <= '2017-07-03'
AND name IN ('name1', 'name2', 'name3', ...)
GROUP BY origin_id, error, main_index, transaction_index, expressions_applied, name, day;
I think the IN clause may be losing performance. I also tried to add uq index to this query, which gave a little gain (FORCE INDEX (dh, uq)).
Plus, I tried to INNER JOIN (SELECT name FROM providers WHERE id = 2) prov ON prov.name = il.name but doesn't result in a quicker query as well.
EDIT
EXPLAINing the query
id - 1
select_type - SIMPLE
table - imported_lines
type - range
possible_keys - uq, dh
key - dh
key_len - 261
ref - NULL
rows - 297988
extra - Using where; Using temporary; Using filesort
Any suggestions what it should do?
I have done a few changes, adding a new index with multiple columns (as suggested by #Uueerdo) and rewritten query as another user suggested too (but he deleted his answer).
I ran a few EXPLAIN PARTITIONS with queries, tested with SQL_NO_CACHE in order to guarantee it wouldn't use cache and searching data for one whole month now takes 1.8s.
It's so much faster!
This is what I did:
ALTER TABLE `imported_lines` DROP INDEX dh;
ALTER TABLE `imported_lines` ADD INDEX dhc (`day`, `name`, `client_id`);
Query:
SELECT origin_id, error, main_index, transaction_index,
expressions_applied, name, day,
COUNT(id) AS total, SUM(count_records) AS sum_records
FROM imported_lines il
INNER JOIN (
SELECT id FROM imported_lines
WHERE client_id = 1
AND day >= '2017-07-01' AND day <= '2017-07-31'
AND name IN ('name1', 'name2', 'name3', ...)
) AS il_filter
ON il_filter.id = il.id
WHERE il.client_id = 1
GROUP BY origin_id, error, main_index, transaction_index, expressions_applied, name, day;
I realized using INNER JOIN, EXPLAIN PARTITIONS it began to use index. Also with WHERE il.client_id = 1, query reduces the number of partitions to look up.
Thanks for your help!
Related
I would like to include a column in my table with the simple moving average of stock data. I have been able to create several queries which successfully do so, however the query speed is slow. My goal is to improve the query speed.
I have the following table:
CREATE TABLE `timeseries_test` (
`timeseries_id` int(11) NOT NULL AUTO_INCREMENT,
`stock_id` int(10) NOT NULL,
`date` date NOT NULL,
`open` decimal(16,8) NOT NULL,
`high` decimal(16,8) NOT NULL,
`low` decimal(16,8) NOT NULL,
`close` decimal(16,8) NOT NULL,
`adjusted_close` double(16,8) NOT NULL,
`volume` int(16) NOT NULL,
`dividend` double(16,8) NOT NULL,
`split_coefficient` double(16,15) NOT NULL,
`100sma` decimal(16,8) NOT NULL,
PRIMARY KEY (`timeseries_id`),
KEY `stock` (`stock_id`),
KEY `date` (`date`),
KEY `date_stock` (`stock_id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=5444325 DEFAULT CHARSET=latin1
I have tried many different query formats, but they all take about 25 seconds per 5000 rows. The select query only takes less than a second. Below an example query:
UPDATE stock.timeseries_test t1 INNER JOIN (
SELECT a.timeseries_id,
Round( ( SELECT SUM(b.close) / COUNT(b.close)
FROM timeseries_test AS b
WHERE DATEDIFF(a.date, b.date) BETWEEN 0 AND 99 AND a.stock_id = b.stock_id
), 2 ) AS '100sma'
FROM timeseries_test AS a) t2
ON t1.`timeseries_id` = t2.`timeseries_id`
SET t1.100sma = t2.100SMA
WHERE t2.100sma = null
Below the explain query:
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 10385 10.00 Using where
1 UPDATE t1 NULL eq_ref PRIMARY PRIMARY 4 t2.timeseries_id 1 100.00 NULL
2 DERIVED a NULL index NULL date_stock 7 NULL 10385 100.00 Using index
3 DEPENDENT SUBQUERY b NULL ref stock,date_stock stock 4 stock.a.stock_id 5192 100.00 Using where
Any help is appreciated.
If you are running MySQL 8.0, I recommend window functions with a range specification; this avois the need for a correlated subquery.
update stock.timeseries_test t1
inner join (
select timeseries_id,
avg(close) over(
partition by stock_id
order by date
range between interval 99 day preceding and current row
) `100sma`
from timeseries_test
) t2 on t1.timeseries_id = t2.timeseries_id
set t1.`100sma` = t2.`100sma`
It is quite unclear what the purpose of the original, outer where clause is, so I removed it:
WHERE t2.`100sma` = null
If you do want to check for nullness, then you need is null; but doing so would pretty much defeat whole logic of the update statement. Maybe you meant:
WHERE t1.`100sma` is null
Functions are not sargable. Instead of
DATEDIFF(a.date, b.date) BETWEEN 0 AND 99
use
a.date BETWEEN b.date AND b.date + INTERVAL 99 DAY
(or maybe a and b should be swapped)
I suspect (from the column names) that the pair (stock_id,date) is unique and that timeseries_id is never really used. If those are correct, then
PRIMARY KEY (`timeseries_id`),
KEY `date_stock` (`stock_id`,`date`)
-->
PRIMARY KEY(`stock_id`,`date`)
The ON(timestamp_id would need to be changed to testing both those columns.
Also, toss this since there is another index that starts with the same column(s):
KEY `stock` (`stock_id`),
in the software that i develop, a car delear software, there's a section with the agenda with all the appointments of the users.
This section is pretty fast to load with a daily and normal use of the agenda, thousands of rows, but start to be really slow when the agenda tables reach 1 million of rows.
The structure:
1) Main table
CREATE TABLE IF NOT EXISTS `agenda` (
`id_agenda` int(11) NOT NULL AUTO_INCREMENT,
`id_user` int(11) NOT NULL DEFAULT '0',
`id_agency` int(11) NOT NULL DEFAULT '0',
`id_customer` int(11) DEFAULT NULL,
`id_car` int(11) DEFAULT NULL,
`id_owner` int(11) DEFAULT NULL,
`type` int(11) NOT NULL DEFAULT '8',
`title` varchar(255) NOT NULL DEFAULT '',
`text` text NOT NULL,
`start_day` date NOT NULL DEFAULT '0000-00-00',
`end_day` date NOT NULL DEFAULT '0000-00-00',
`start_hour` time NOT NULL DEFAULT '00:00:00',
`end_hour` time NOT NULL DEFAULT '00:00:00'
PRIMARY KEY (`id_agenda`),
KEY `start_day` (`start_day`),
KEY `id_customer` (`id_customer`),
KEY `id_car` (`id_car`),
KEY `id_user` (`id_user`),
KEY `id_owner` (`id_owner`),
KEY `type` (`type`),
KEY `id_agency` (`id_agency`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ;
2) Secondary table
CREATE TABLE IF NOT EXISTS `agenda_cars` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_agenda` int(11) NOT NULL,
`id_car` int(11) NOT NULL,
`id_owner` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `id_agenda` (`id_agenda`),
KEY `id_car` (`id_car`),
KEY `id_owner` (`id_owner`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Query:
SELECT a.id_agenda
FROM agenda as a
LEFT JOIN agenda_cars as agc on agc.id_agenda = a.id_agenda
WHERE
(a.id_customer = '22' OR (a.id_owner = '22' OR agc.id_owner = '22' ))
GROUP BY a.id_agenda
ORDER BY a.start_day, a.start_hour
Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index PRIMARY PRIMARY 4 NULL 1051987 Using temporary; Using filesort
1 SIMPLE agc ref id_agenda id_agenda 4 db.a.id_agenda 1 Using where
The query reachs 10 secs to end, with the id 22, but with other id can reach also 20 secs, this just for the query, to load all in the web page take of course more time.
I don't get the point why it takes so long to get the data, i think the indexes are right configured and the query is pretty simple, so why?
Too much data?
I've solved in this way:
SELECT a.id_agenda
FROM
(
SELECT id_agenda
FROM agenda
WHERE (id_customer = '22' OR id_owner = '22' )
UNION
SELECT id_agenda
FROM agenda_cars
WHERE id_owner = '22'
) as at
INNER JOIN agenda as a on a.id_agenda = at.id_agenda
GROUP BY a.id_agenda
ORDER BY a.start_day, a.start_hour
This version of the query is ten times faster the then previous...but why?
Thanks to all want to contribute to solve my doubts!
UPDATE AFTER Rick James solution:
Query suggested
SELECT a.id_agenda
FROM
(
SELECT id_agenda FROM agenda WHERE id_customer = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda WHERE id_owner = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda_cars WHERE id_owner = '22'
) as at
INNER JOIN agenda as a ON a.id_agenda = at.id_agenda
ORDER BY a.start_datetime;
Result: 279 total, 0.0111 sec
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 366 Using temporary; Using filesort
1 PRIMARY a eq_ref PRIMARY PRIMARY 4 at.id_agenda 1 NULL
2 DERIVED agenda ref id_customer id_customer 5 const 1 Using index
3 UNION agenda ref id_owner id_owner 5 const 114 Using index
4 UNION agenda_cars ref id_owner id_owner 4 const 250 NULL
NULL UNION RESULT <union2,3,4> ALL NULL NULL NULL NULL NULL Using temporary
Before I dig into what can be done, let me list several reg flags I see.
OR is hard to optimize
Filtering (WHERE) on multiple tables JOINed together is hard to optimize.
GROUP BY x ORDER BY z means two passes over the data, usually 2 temp tables and filesorts.
Did you really mean LEFT? It says "the right table (agc) might be missing, in which case provide NULLs".
(You may not be able to get rid of all of the red flags.)
Red flags in the Schema:
Indexing every column -- usually not useful
Only single-column indexes -- "composite" indexes often help.
DATE and TIME as separate columns -- usually makes for clumsy queries.
OK, those are off my shoulder, now to study the query... (Oh, and thanks for providing the CREATEs and EXPLAIN!)
The ON implies a 1:many relationship between agenda:agenda_cars. Is that correct?
id_owner and id_car are in both tables, yet are not included in the ON; what's up?
(Here's the meat of the answer to your final question.) Why have GROUP BY? I see no aggregates. I will guess that the 1:many relationship lead to multiple rows, and you needed to de-dup? For dedupping, please use DISTINCT. But, the real solution is to avoid the "inflate (JOIN) - deflate (GROUP BY)" syndrome. Your subquery is a good start on that.
Rolling some of the above comments in, plus more:
SELECT a.id_agenda
FROM
(
SELECT id_agenda FROM agenda WHERE id_customer = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda WHERE id_owner = '22'
UNION DISTINCT
SELECT id_agenda FROM agenda_cars WHERE id_owner = '22'
) as at
INNER JOIN agenda as a ON a.id_agenda = at.id_agenda
ORDER BY a.start_datetime;
Notes:
Got rid of the other OR
Explicit UNION DISTINCT to be clear that dups are expected.
Toss GROUP BY and not using SELECT DISTINCT; UNION DISTINCT deals with the need.
You have the 4 necessary indexes (one per subquery): (id_customer), (id_owner) (on both tables) and PRIMARY KEY(id_agenda).
The indexes are "covering indexes for all the subqueries -- an extra bonus.
There will be one unavoidable tmp table and file sort -- for the ORDER BY, but it won't be on a million rows.
(No need for composite indexes -- this time.)
I changed to a DATETIME; change back if you have a good reason for splitting them.
Did I get you another 10x? Did I explain it sufficiently?
Oh, one more thing...
This query returns an list of ids ordered by something that it does not return (date+time). What will you do with ids? If you are using this as a subquery in another table, then the Optimizer has a right to throw away the ORDER BY. Just warning you.
How can I proceed to make my response time more faster, approximately the average time of response is 0.2s ( 8039 records in my items table & 81 records in my tracking table )
Query
SELECT a.name, b.cnt FROM `items` a LEFT JOIN
(SELECT guid, COUNT(*) cnt FROM tracking WHERE
date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day ) GROUP BY guid) b ON
a.`id` = b.guid WHERE a.`type` = 'streaming' AND a.`state` = 1
ORDER BY b.cnt DESC LIMIT 15 OFFSET 75
Tracking table structure
CREATE TABLE `tracking` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`ip` int(11) NOT NULL,
`date` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `i1` (`ip`,`guid`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=4303 DEFAULT CHARSET=latin1;
Items table structure
CREATE TABLE `items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`type` varchar(255) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`embed` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`description` text,
`tags` varchar(255) DEFAULT NULL,
`date` int(11) DEFAULT NULL,
`vote_val_total` float DEFAULT '0',
`vote_total` float(11,0) DEFAULT '0',
`rate` float DEFAULT '0',
`icon` text CHARACTER SET ascii,
`state` int(11) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=9258 DEFAULT CHARSET=latin1;
Your query, as written, doesn't make much sense. It produces all possible combinations of rows in your two tables and then groups them.
You may want this:
SELECT a.*, b.cnt
FROM `items` a
LEFT JOIN (
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE `date` > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day)
GROUP BY guid
) b ON a.guid = b.guid
ORDER BY b.cnt DESC
The high-volume data in this query come from the relatively large tracking table. So, you should add a compound index to it, using the columns (date, guid). This will allow your query to random-access the index by date and then scan it for guid values.
ALTER TABLE tracking ADD INDEX guid_summary (`date`, guid);
I suppose you'll see a nice performance improvement.
Pro tip: Don't use SELECT *. Instead, give a list of the columns you want in your result set. For example,
SELECT a.guid, a.name, a.description, b.cnt
Why is this important?
First, it makes your software more resilient against somebody adding columns to your tables in the future.
Second, it tells the MySQL server to sling around only the information you want. That can improve performance really dramatically, especially when your tables get big.
Since tracking has significantly fewer rows than items, I will propose the following.
SELECT i.name, c.cnt
FROM
(
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day )
GROUP BY guid
) AS c
JOIN items AS i ON i.id = c.guid
WHERE i.type = 'streaming'
AND i.state = 1;
ORDER BY c.cnt DESC
LIMIT 15 OFFSET 75
It will fail to display any items for which cnt is 0. (Your version displays the items with NULL for the count.)
Composite indexes needed:
items: The PRIMARY KEY(id) is sufficient.
tracking: INDEX(date, guid) -- "covering"
Other issues:
If ip is an IP-address, it needs to be INT UNSIGNED. But that covers only IPv4, not IPv6.
It seems like date is not just a "date", but really a date+time. Please rename it to avoid confusion.
float(11,0) -- Don't use FLOAT for integers. Don't use (m,n) on FLOAT or DOUBLE. INT UNSIGNED makes more sense here.
OFFSET is naughty when it comes to performance -- it must scan over the skipped records. But, in your query, there is no way to avoid collecting all the possible rows, sorting them, stepping over 75, and only finally delivering 15 rows. (And, with no more than 81, it won't be a full 15.)
What version are you using? There have been important changes to the Optimization of LEFT JOIN ( SELECT ... ). Please provide EXPLAIN SELECT for each query under discussion.
My app needs to run this query pretty often, which gets a list of user data for the app to display. The problem is that subquery about the user_quiz is resource heavy and calculating the rankings are also very CPU intense too.
Benchmark: ~.5 second each run
When it will be run:
When the user want to see their ranking
When the user want to see other people's ranking
Getting a list of user's friends
.5 second it's a really long time considering this query will be run pretty often. Is there anything I could do to optimize this query?
Table for user:
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(100) DEFAULT NULL,
`lastname` varchar(100) DEFAULT NULL,
`password` varchar(20) NOT NULL,
`email` varchar(300) NOT NULL,
`verified` tinyint(10) DEFAULT NULL,
`avatar` varchar(300) DEFAULT NULL,
`points_total` int(11) unsigned NOT NULL DEFAULT '0',
`points_today` int(11) unsigned NOT NULL DEFAULT '0',
`number_correctanswer` int(11) unsigned NOT NULL DEFAULT '0',
`number_watchedvideo` int(11) unsigned NOT NULL DEFAULT '0',
`create_time` datetime NOT NULL,
`type` tinyint(1) unsigned NOT NULL DEFAULT '1',
`number_win` int(11) unsigned NOT NULL DEFAULT '0',
`number_lost` int(11) unsigned NOT NULL DEFAULT '0',
`number_tie` int(11) unsigned NOT NULL DEFAULT '0',
`level` int(1) unsigned NOT NULL DEFAULT '0',
`islogined` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=230 DEFAULT CHARSET=utf8;
Table for user_quiz:
CREATE TABLE `user_quiz` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`question_id` int(11) NOT NULL,
`is_answercorrect` int(11) unsigned NOT NULL DEFAULT '0',
`question_answer_datetime` datetime NOT NULL,
`score` int(1) DEFAULT NULL,
`quarter` int(1) DEFAULT NULL,
`game_type` int(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=9816 DEFAULT CHARSET=utf8;
Table for user_starter:
CREATE TABLE `user_starter` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`result` int(1) DEFAULT NULL,
`created_date` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=456 DEFAULT CHARSET=utf8mb4;
My indexes:
Table: user
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
user 0 PRIMARY 1 id A 32 BTREE
Table: user_quiz
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
user_quiz 0 PRIMARY 1 id A 9462 BTREE
user_quiz 1 user_id 1 user_id A 270 BTREE
Table: user_starter
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
user_starter 0 PRIMARY 1 id A 454 BTREE
user_starter 1 user_id 1 user_id A 227 YES BTREE
Query:
SET #curRank = 0;
SET #lastPlayerPoints = 0;
SELECT
sub.*,
#curRank := IF(#lastPlayerPoints!=points_week, #curRank + 1, #curRank) AS rank,
#lastPlayerPoints := points_week AS db_PPW
FROM (
SELECT u.id,u.firstname,u.lastname,u.email,u.avatar,u.type,u.points_total,u.number_win,u.number_lost,u.number_tie,u.verified,
COALESCE(SUM(uq.score),0) as points_week,
COALESCE(us.number_lost,0) as number_week_lost,
COALESCE(us.number_win,0) as number_week_win,
(select MAX(question_answer_datetime) from user_quiz WHERE user_id = u.id and game_type = 1) as lastFrdFight,
(select MAX(question_answer_datetime) from user_quiz WHERE user_id = u.id and game_type = 2) as lastBotFight
FROM `user` u
LEFT JOIN (SELECT user_id,
count(case when result=1 then 1 else null end) as number_win,
count(case when result=-1 then 1 else null end) as number_lost
from user_starter where created_date BETWEEN '2016-01-11 00:00:00' AND '2016-05-12 05:10:27' ) us ON u.id = us.user_id
LEFT JOIN (SELECT * FROM user_quiz WHERE question_answer_datetime BETWEEN '2016-01-11 00:00:00' AND '2016-05-12 00:00:00') uq on u.id = uq.user_id
GROUP BY u.id ORDER BY points_week DESC, u.lastname ASC, u.firstname ASC
) as sub
EXPLAIN:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 3027 100
2 DERIVED u ALL PRIMARY 32 100 Using temporary; Using filesort
2 DERIVED <derived5> ALL 1 100 Using where; Using join buffer (Block Nested Loop)
2 DERIVED <derived6> ref <auto_key0> <auto_key0> 4 fancard.u.id 94 100
6 DERIVED user_quiz ALL 9461 100 Using where
5 DERIVED user_starter ALL 454 100 Using where
4 DEPENDENT SUBQUERY user_quiz ref user_id user_id 4 func 35 100 Using where
3 DEPENDENT SUBQUERY user_quiz ref user_id user_id 4 func 35 100 Using where
Example output and expected output:
Bench mark: around .5 second
The following index should make the subquery to user_quiz ultra fast.
ALTER TABLE user_quiz
ADD INDEX (`user_id`,`game_type`,`question_answer_datetime`)
Please provide SHOW CREATE TABLE tablename statements for all tables, as that will help with additional optimizations.
Update #1
Alright, I've had some time to look things over, and fortunately there a appears to be a lot of relatively low hanging fruit in terms of optimization.
Here are all the indexes to add:
ALTER TABLE user_quiz
ADD INDEX `userGametypeAnswerDatetimes` (`user_id`,`game_type`,`question_answer_datetime`)
ALTER TABLE user_quiz
ADD INDEX `userAnswerScores` (`user_id`,`question_answer_datetime`,`score`)
ALTER TABLE user_starter
ADD INDEX `userResultDates` (`user_id`,`result`,`created_date`)
Note that the names (such as userGametypeAnswerDatetimes) are optional, and you can name them to whatever makes the most sense to you. But, in general, it's good to put specific names on your custom indexes (simply for organization purposes.)
Now, here is your query that should work will with those new indexes:
SET #curRank = 0;
SET #lastPlayerPoints = 0;
SELECT
sub.*,
#curRank := IF(#lastPlayerPoints!=points_week, #curRank + 1, #curRank) AS rank,
#lastPlayerPoints := points_week AS db_PPW
FROM (
SELECT u.id,
u.firstname,
u.lastname,
u.email,
u.avatar,
u.type,
u.points_total,
u.number_win,
u.number_lost,
u.number_tie,
u.verified,
COALESCE(user_scores.score,0) as points_week,
COALESCE(user_losses.number_lost,0) as number_week_lost,
COALESCE(user_wins.number_win,0) as number_week_win,
(
select MAX(question_answer_datetime)
from user_quiz
WHERE user_id = u.id and game_type = 1
) as lastFrdFight,
(
select MAX(question_answer_datetime)
from user_quiz
WHERE user_id = u.id
and game_type = 2
) as lastBotFight
FROM `user` u
LEFT OUTER JOIN (
SELECT user_id,
COUNT(*) AS number_won
from user_starter
WHERE created_date BETWEEN '2016-01-11 00:00:00' AND '2016-05-12 05:10:27'
AND result = 1
GROUP BY user_id
) user_wins
ON user_wins.user_id = u.user_id
LEFT OUTER JOIN (
SELECT user_id,
COUNT(*) AS number_lost
from user_starter
WHERE created_date BETWEEN '2016-01-11 00:00:00' AND '2016-05-12 05:10:27'
AND result = -1
GROUP BY user_id
) user_losses
ON user_losses.user_id = u.user_id
LEFT OUTER JOIN (
SELECT SUM(score)
FROM user_quiz
WHERE question_answer_datetime
BETWEEN '2016-01-11 00:00:00' AND '2016-05-12 00:00:00'
GROUP BY user_id
) user_scores
ON u.id = user_scores.user_id
ORDER BY points_week DESC, u.lastname ASC, u.firstname ASC
) as sub
Note: This is not necessarily the best result. It depends a LOT on your data set, as to whether this is necessarily the best, and sometimes you need to do a bit of trial and error.
A hint as to what you can use trial and error on is the structure of how we query the lastFrdFight and lastBotFight verses how we query points_week, number_week_lost, number_week_win. All of these could either be done in the select statement (like the first two are in my query) or could be done by joining to a subquery result (like the last three do, in my query.)
Mix and match to see what works best. In general, I've found the joining to a subquery to be fastest when you have a large number of rows in the outer query (in this case, querying the user table.) This is because it only needs to get the results once, and then can just match them up on a user by user basis. Other times, it can be better to have the query just in the SELECT clause - this will run MUCH faster, since there are more constants (the user_id is already known), but has to run for each row. So it's a trade off, and why you sometimes need to use trial and error.
Why do the indexes work?
So, you may be wondering why I made the indexes as I did. If you are familiar with phone books (in this age of smartphones, that's no longer a valid assumption I can make) then we can use that as an analogy:
If you had a composite index of phonebookIndex (lastname,firstname,email) on your user table (example here! you don' actually need to add that index!) you would have a result similar to what a phone book provides. (Using email instead of phone number.)
Each index is an internal copy of the data in the overall table. With this phonebookIndex there would internally be stored a list of all users with their lastname, then their first name, and then their email, and each of these would be ordered, just like a phone book.
Why is that useful? Consider when you know someone's first and last name. You can quickly flip to where their last name is, then quickly go through that list of everyone with their last name, finding the first name you want, so obtaining the email.
Indexes work in exactly the same way, in terms of how the database looks at them.
Consider the userGametypeAnswerDatetimes index I defined above, and how we query that index in the lastFrdFight SELECT subquery.
(
select MAX(question_answer_datetime)
from user_quiz
WHERE user_id = u.id and game_type = 1
) as lastFrdFight
Notice how we have both the user_id (from the outer query) and the game_type as constants. That is exactly like our example earlier, with having the first and last name, and wanting to look up an email/phone number. In this case, we are looking for the MAX of the 3rd value in the index. Still easy: All the values are ordered, so if this index was sitting in front of us, we could just flip to the specific user_id, then look at the section with all game_type=1 and then just pick the last value to find the maximum. Very very fast. Same for the database. It can find this value extremely fast, which is why you saw an 80%+ reduction in your overall query time.
So, that's how indexes work, and why I choose these indexes as I did.
Be aware, that the more indexes you have, the more you'll see slowdowns when doing inserts and updates. But, if you are reading a lot more from your tables than you are writing, this is usually a more than acceptable trade off.
So, give these changes a shot, and let me know how it performs. Please provide the new EXPLAIN plan if you want further optimization help. Also, this should give you quite a bit of tools to use trial and error to see what does work at what doesn't. All my changes are fairly independent of each other, so you can swap them in and out with your original query pieces to see how each one works.
I am trying to query against a partitioned table (by month) approaching 20M rows. I need to group by DATE(transaction_utc) as well as country_id. The rows that get returned if i turn off the group by and aggregates is just over 40k, which isn't too many, however adding the group by makes the query substantially slower unless said GROUP BY is on the transaction_utc column, in which case it gets FAST.
I've been trying to optimize this first query below by tweaking the query and/or the indexes, and got to the point below (about 2x as fast as initially) however still stuck with a 5s query for summarizing 45k rows, which seems way too much.
For reference, this box is a brand new 24 logical core, 64GB RAM, Mariadb-5.5.x server with way more INNODB buffer pool available than index space on the server, so shouldn't be any RAM or CPU pressures.
So, I'm looking for ideas on what is causing this slow down and suggestions on speeding it up. Any feedback would be greatly appreciated! :)
Ok, onto the details...
The following query (the one I actually need) takes approx 5 seconds (+/-), and returns less than 100 rows.
SELECT lss.`country_id` AS CountryId
, Date(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`, DATE(lss.`transaction_utc`)
EXPLAIN SELECT for the same query is as follows. Notice that it's not using the transaction_utc key. Shouldn't it be using my covering index instead?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE lss ref idx_unique,transaction_utc,country_id idx_unique 50 const 1208802 Using where; Using temporary; Using filesort
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 georiot.lss.country_id 1
Now onto a couple other options that I've tried to attempt to determine whats going on...
The following query (changed group by) takes about 5 seconds (+/-), and returns only 3 rows:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`
The following query (removed group by) takes 4-5 seconds (+/-) and returns 1 row:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
The following query takes .00X seconds (+/-) and returns ~45k rows. This to me shows that at max we're only trying to group 45K rows into less than 100 groups (as in my initial query):
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
GROUP BY lss.`transaction_utc`
TABLE SCHEMA:
CREATE TABLE IF NOT EXISTS `sales` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_linkshare_account_id` int(11) unsigned NOT NULL,
`username` varchar(16) NOT NULL,
`country_id` int(4) unsigned NOT NULL,
`order` varchar(16) NOT NULL,
`raw_tracking_code` varchar(255) DEFAULT NULL,
`transaction_utc` datetime NOT NULL,
`processed_utc` datetime NOT NULL ,
`sku` varchar(16) NOT NULL,
`sale_original` decimal(10,4) NOT NULL,
`sale_usd` decimal(10,4) NOT NULL,
`quantity` int(11) NOT NULL,
`commission_original` decimal(10,4) NOT NULL,
`commission_usd` decimal(10,4) NOT NULL,
`original_currency` char(3) NOT NULL,
PRIMARY KEY (`id`,`transaction_utc`),
UNIQUE KEY `idx_unique` (`username`,`order`,`processed_utc`,`sku`,`transaction_utc`),
KEY `raw_tracking_code` (`raw_tracking_code`),
KEY `idx_usd_amounts` (`sale_usd`,`commission_usd`),
KEY `idx_countries` (`country_id`),
KEY `transaction_utc` (`transaction_utc`,`username`,`country_id`,`sale_usd`,`commission_usd`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE ( TO_DAYS(`transaction_utc`))
(PARTITION pOLD VALUES LESS THAN (735112) ENGINE = InnoDB,
PARTITION p201209 VALUES LESS THAN (735142) ENGINE = InnoDB,
PARTITION p201210 VALUES LESS THAN (735173) ENGINE = InnoDB,
PARTITION p201211 VALUES LESS THAN (735203) ENGINE = InnoDB,
PARTITION p201212 VALUES LESS THAN (735234) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ AUTO_INCREMENT=19696320 ;
The offending part is probably the GROUP BY DATE(transaction_utc). You also claim to have a covering index for this query but I see none. Your 5-column index has all the columns used in the query but not in the best order (which is: WHERE - GROUP BY - SELECT).
So, the engine, finding no useful index, would have to evaluate this function for all the 20M rows. Actually, it finds an index that starts with username (the idx_unique) and it uses that, so it has to evaluate the function for (only) 1.2M rows. If you had a (transaction_utc) or a (username, transaction_utc) it would choose the most useful of the three.
Can you afford to change the table structure by splitting the column into date and time parts?
If you can, then an index on (username, country_id, transaction_date) or (changing the order of the two columns used for grouping), on (username, transaction_date, country_id) would be quite efficient.
A covering index on (username, country_id, transaction_date, sale_usd, commission_usd) even better.
If you want to keep the current structure, try changing the order inside your 5-column index to:
(username, country_id, transaction_utc, sale_usd, commission_usd)
or to:
(username, transaction_utc, country_id, sale_usd, commission_usd)
Since you are using MariaDB, you can use the VIRTUAL columns feature, without changing the existing columns:
Add a virtual (persistent) column and the appropriate index:
ALTER TABLE sales
ADD COLUMN transaction_date DATE NOT NULL
AS DATE(transaction_utc)
PERSISTENT
ADD INDEX special_IDX
(username, country_id, transaction_date, sale_usd, commission_usd) ;