Mysql sum group by performance 9M records

Mysql sum group by performance 9M records - mysql

I have a table with ~ 9 million records
Structure
id int PK AI
pa_id int
cha_id smallint
cha_level tinyint
cha_points mediumint
cha_points_till smallint
cha_points_from medium
cha_points_date datetime
My query
select max(cha_points) as highest,cha_id,count(id) as entry_count,
sum(cha_points) as total_points
from playeraccounts_cha_masteries
group by cha_id
order by total_points desc
My indexes
playeraccounts_cha_masteries 0 PRIMARY 1 id A 9058483 NULL NULL BTREE
playeraccounts_cha_masteries 1 cha_id 1 cha_id A 9 NULL NULL BTREE
playeraccounts_cha_masteries 1 pa_id 1 pa_id A 156270 NULL NULL BTREE
playeraccounts_cha_masteries 1 cha_points 1 cha_points A 166100 NULL NULL BTREE
The index on pa_id has its use in a different query.
Explain
id select_type table partitions type possible_keys key key_len ref rows filterd Extra
1 simple m null range PRIMARY,cha_id PRIMARY 4 NULL 9164555 100.00 Using where; Using temporary; Using filesort
Is there any i can speed up the query still?

You have 3 options:
Speed up the existing query
Create a composite index on cha_id and cha_points fields, change count(id) to count(*) or count(cha_id), and test again. You may have to play with the order of fields in the index. Check with explain if the covering index is used.
By changing count(id) to count(*) or count(cha_id) you eliminate the need to check the id column. Since you use that count to return the number of records within each cha_id group, it is safe to replace the reference to id field with * or cha_id.
Creating a composite index on cha_id and cha_points fields will result in a covering index, meaning all fields required by the query is in a single index, so the query does not have to scan the entire table.
Create a separate statistics table and update it with triggers
Create a separate statistics table for playeraccounts_cha_masteries. You can use triggers to update counts, maximums, and totals. The page would query the statistics table instead of the playeraccounts_cha_masteries table. This solution may slow inserts / updates / deletes down, since each data modification transaction has to be serialised, so that the statistics table is properly updated.
Create a separate statistics table and update it periodically
Create a separate statistics table, but instead of using triggers to keep it constantly updated, use scheduled job (OS or mysql level) to periodically update the table with the latest statistics. This would mean that the stats will be out of sync for a while, but this may be a reasonable compromise, if an acceptable refresh period can be found.
You can even take this approach one step further, and instead of generating a separate statistics table, you can generate a static html file with appropriate expiry set in its headers with the statistics. This way the server has only to serve the static file for the statistics.

Related

Very slow query when using `id in (max(id))` in subquery

We recently moved our database from MariaDB to AWS Amazon Aurora RDS (MySQL). We observed something strange in a set of queries. We have two queries that are very quick, but when together as nested subquery it takes ages to finish.
Here id is the primary key of the table
SELECT * FROM users where id in(SELECT max(id) FROM users where id = 1);
execution time is ~350ms
SELECT * FROM users where id in(SELECT id FROM users where id = 1);
execution time is ~130ms
SELECT max(id) FROM users where id = 1;
execution time is ~130ms
SELECT id FROM users where id = 1;
execution time is ~130ms
We believe it has to do something with the type of value returned by max that is causing the indexing to be ignored when running the outer query from results of the sub query.
All the above queries are simplified for illustration of the problem. The original queries have more clauses as well as 100s of millions of rows. The issue did not exist prior to the migration and worked fine in MariaDB.
--- RESULTS FROM MariaDB ---

MySQL seems to optimize less efficient compared to MariaDB (int this case).
When doing this in MySQL (see: DBFIDDLE1), the execution plans look like:
For the query without MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
For the query with MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY integers null index null
PRIMARY
4 null
1000
100.00 Using where; Using index
2 DEPENDENT SUBQUERY null null null null
null
null null
null
null Select tables optimized away
While MariaDB (see: DBFIDDLE2 does have a better looking plan when using MAX:
id select_type table type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY system null
null
null null
1
100.00
1 PRIMARY integers const PRIMARY
PRIMARY
4 const
1
100.00 Using index
2 MATERIALIZED null null null
null
null null
null
null Select tables optimized away
EDIT: Because of time (some lack of it 😉) I now add some info
A suggestion to fix this:
SELECT *
FROM integers
WHERE i IN (select * from (SELECT MAX(i) FROM integers WHERE i=1)x);
When looking at the EXECUTION PLAN from MariaDB, which has 1 extra step, I tried to do the same in MySQL. Above query has an even bigger execution plan, but tests show that it performs better. (for explain plans, see: DBFIDDLE1a)
"the question is Mariadb that much faster? it uses a step more that mysql"
One step more does not mean that things get slower.
MySQL takes about 2-3 seconds on the query using the MAX, and MariaDB does execute the same in under 10 msecs. But this is performance, and time may vary on different systems.

SELECT max(id) FROM users where id = 1
Is strange. Since it is looking only at rows where id = 1, then "max" is obviously "1". So is the min. And the average.\
Perhaps you wanted:
SELECT max(id) FROM users
Is there an index on id? Perhaps the PRIMARY KEY? If not, then that might explain the sluggishness.
This can be done much faster (against assuming an index):
SELECT * FROM users
ORDER BY id DESC
LIMIT 1
Does that give you what you want?
To discuss this further, please provide SHOW CREATE TABLE users

Avoid filesort in simple filtered ordered query

I have a simple table:
CREATE TABLE `user_values` (
`id` bigint NOT NULL AUTO_INCREMENT,
`user_id` bigint NOT NULL,
`value` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`,`id`),
KEY `id` (`id`,`user_id`);
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
that I am trying to execute the following simple query:
select * from user_values where user_id in (20020, 20030) order by id desc;
I would fully expect this query to 100% use an index (either the (user_id, id) one or the (id, user_id) one) Yet, it turns out that's not the case:
explain select * from user_values where user_id in (20020, 20030); yields:
id
select_type
table
partitions
type
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
user_values
NULL
range
user_id
8
NULL
9
100.00
Using index condition; Using filesort
Why is that the case? How can I avoid a filesort on this trivial query?

You can't avoid the filesort in the query you show.
When you use a range predicate (for example, IN ( ) is a range predicate), and an index is used, the rows are read in index order. But there's no way for the MySQL query optimizer to guess that reading the rows in index order by user_id will guarantee they are also in id order. The two user_id values you are searching for are potentially scattered all over the table, in any order. Therefore MySQL must assume that once the matching rows are read, an extra step of sorting the result by id is necessary.
Here's an example of hypothetical data in which reading the rows by an index on user_id will not be in id order.
id
user_id
1
20030
2
20020
3
20016
4
20030
5
20020
So when reading from an index on (user_id, id), the matching rows will be returned in the following order, sorted by user_id first, then by id:
id
user_id
2
20020
5
20020
1
20030
4
20030
Clearly, the result is not in id order, so it needs to be sorted to satisfy the ORDER BY you requested.
The same kind of effect happens for other type of predicates, for example BETWEEN, or < or != or IS NOT NULL, etc. Every predicate except for = is a range predicate.
The only ways to avoid the filesort are to change the query in one of the following ways:
Omit the ORDER BY clause and accepting the results in whatever order the optimizer chooses to return them, which could be in id order, but only by coincidence.
Change the user_id IN (20020, 20030) to user_id = 20020, so there is only one matching user_id, and therefore reading the matching rows from the index will already be returned in the id order, and therefore the ORDER BY is a no-op. The optimizer recognizes when this is possible, and skips the filesort.

MySQL will most likely use index for the query (unless the user_id's in the query covers most of the rows).
The "filesort" happens in memory (it's really not a filesort), and is used to sort the found rows based on the ORDER BY clause.

You cannot avoid a "sort" in this case.
There were about 9 rows to sort, so it could not have taken long.
How long did the query take? Probably only a few milliseconds, so who cares?
"Filesort" does not necessarily mean that a "file" was involved. In many queries the sort is done in RAM.
Do you use id for anything other than to have a PRIMARY KEY on the table? If not, then this will help a small amount. (The speed-up won't be indicated in EXPLAIN.)
PRIMARY KEY (`user_id`,`id`), -- to avoid secondary lookups
KEY `id` (`id`); -- to keep auto_increment happy

How to decide which fields must be indexed in a database table

Explanation
I have a table which does not have a primary key (or not even a composite key).
The table is for storing the time slots (opening hours and food delivery available hours) of the food shops. Let's call the table "business_hours" and the main fields are as below.
shop_id
day (0 - 6, means Sunday - Saturday)
type (open, delivery)
start_time
end_time
As an example, if shop A is opened on Monday from 9.00am - 01.00pm and 05.00pm to 10.00pm, there will be two records in business_hours table for this scenario.
-----------------------------------------------
| shop_id | day | type | start_time | end_time
-----------------------------------------------
| 1000 | 1 | open | 09:00:00 | 13:00:00
-----------------------------------------------
| 1000 | 1 | open | 17:00:00 | 22:00:00
-----------------------------------------------
When I query this table, I will use shop_id always as the first condition in where clause.
Ex:
SELECT COUNT(*) FROM business_hours WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
Question
Applying index for "shop_id" is enough or "day" & "type" fields also should be indexed?
Also better if you can explain how the indexing really works.

It depends on several factors that you should specify:
How fast will the data grow
What is the estimated table size in rows
What queries will be run against that table
How fast do you expect the queries to run
It is more about thinking like: Some service will make thousands of inserts of new records per hour, the old records will be archived nightly and reports are to be created nightly from that table. In such a case you may prefer to not to create many indexes since they slow down inserts.
On the other hand if your table will grow and change slowly and many users will run queries against it, you need to have proper indexes to speed up queries.
If you can, try to create clustered unique primary key that most queries can benefit from. If you have data that form some timeline and most queries will get ranges of data using the datetime criteria (like from - to), it is better to include datetime in clustered index - you will get fastest query performance.
So something like this will grant you best performance for the mentioned select. (But you cannot store duplicate business hours for one shop and type)
CREATE TABLE Business_hours
( shop_id INT NOT NULL
, day INT NOT NULL
--- other columns
, CONSTRAINT Business_hours_PK
PRIMARY KEY (shop_id, day, type, start_time, end_time) -- your clustered index
)
Just creating an index on fields used in the SELECT (all of them or just some of them most used), will speed up your query too:
CREATE INDEX BusinessHours_IX ON business_hours (shop_id,day,type, start_time, end_time);
Difference between clustered and non-clustered is that clustered index affects order in which are db records stored on disk.
You can use EXPLAIN to find missing indexes in your database, see this answer.
For more detail this blog.

Yes, You are create a clustered index on this column (shop_id,day,type). I have create a index like above:
Create clustered index Ix on business_hours (shop_id,day,type)
Use this index your select query like above:
SELECT COUNT(*) FROM business_hours with (index (Ix)) WHERE shop_id = 1000 AND day = 1 AND type = 'open' AND start_time <= '13.29.00' AND end_time > '13.29.00';
You are get result fast but a table which have a primary key than not create
clustered index and create a non clustered index

It depends on your usability if you are not updating the record then use clustered index
on
CREATE CLUSTERED INDEX Saleperday ON business_hours (shop_id,day,type);
because Clustered index traverse along the B Tree and stores the entire row on node itself, So searching is fast. But Updating records is memory cost effective as it shifts the entire row from memory crating new entry for same record.
OR ELSE
If Your are updating the records then non clustered index.
If you create ware house then use Column Store Indexes
For better understanding your can go to these links
http://www.programmerinterview.com/index.php/database-sql/clustered-vs-non-clustered-index/
http://www.patrickkeisler.com/2014/04/what-is-non-clustered-columnstore-index.html
http://searchsqlserver.techtarget.com/feature/SQL-Server-2014-columnstore-index-the-good-the-bad-and-the-clustered
Please reply for answer.

Having decided against a primary key means the following would be allowed:
| shop_id | day | type | start_time | end_time
+---------+-----+--------+------------+---------
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 09:00:00 | 13:00:00
| 1000 | 1 | open | 17:00:00 | 22:00:00
| 1000 | 1 | closed | 17:00:00 | 22:00:00
So you can have duplicate entries that may lead to strange query results and even have a shop open and closed in the very same time range. (But well, we all know that even with a primary key you'd still need a before-insert trigger to detect a range overlapping, e.g. 12:00-15:00 vs. 13:00-16:00, and throw an error in case. - How I wish there were some built-in range detection, so we could, say, have a unique index on (shop_id, day, range(start_time, end_time)).)
As to your question: Provided your database is built well, you already have a foreign key on shop_id. You don't need any further index as long as you consider your queries fast enough.
Once you think you need to speed them up, you can add composite indexes as needed. That would usually be an index on all columns in the slow query's WHERE clause. If that still doesn't suffice add the columns that are in the GROUP BY clause, if any. Next step would be to add the columns of the HAVING clause, if any. Next would be the columns of the ORDER BY clause. And the last step would be to even add all columns in your SELECT clause, which would give you a covering index, i.e. all data needed for the query would be in the index and the table itself would hence not have to be accessed any longer.
But as mentioned: As long as you don't have performance issues, you don't have to add any composite indexes.

To decide which fields must be indexed in a database table you need to observe the behavior of each query sent to the table. Indexes are the means of providing an efficient access path between the application and the data. The index provides the access path; so, when query asks for data to the database, it will know where to go to retrieve the data.
Here is some official Microsoft documentation
Clustered Indexes A clustered index stores the actual table data pages at the leaf level, and the table data is ordered physically
around the key. A table can have only one clustered index, and when
this index is created, the following events also occur: • Table data
is rearranged. • New index pages are created. • All nonclustered
indexes within the database are rebuilt. As a result, there are many
disk I/O operations and extensive use of system and memory resources.
If you plan to create a clustered index, be sure you have free space
equal to at least 1.5 times the amount of data in the table. The extra
free space ensures that you have enough space to complete the
operation efficiently.
Nonclustered Indexes In a nonclustered index, pages at the leaf level contain a bookmark that tells SQL Server where to find the data
row corresponding to the key in the index. If the table has a
clustered index, the bookmark indicates the clustered index key. If
the table does not have a clustered index, the bookmark is an actual
row locator. When you create a nonclustered index, SQL Server creates
the required index pages but does not rearrange table data.
The Indexing Method recommended by professionals is comprised of three phases: Monitor, Analyze, and then implements the index. That
means you need to observe the behavior of your database when you run a
query then work for get the best performance
SQL server use this operation for fetch the data:
Table scan: Reads the entire heap and, most likely, passes all the data to a secondary filter operation
Index scan: Reads the entire leaf level (every row) of the clustered index or non-clustered index. The index scan operation might
filter the rows and return only those rows that meet the criteria, or
it might pass all the rows to another filter operation depending on
the complexity of the criteria. The data may or may not be ordered.
Index seek: Locates specific row(s) data using the index and returns only the selected rows in an ordered list
So, once you know that you can run the query and use the option Display the Estimated Execution Plan and analyses the performance,
I recommend reading this post SQL SERVER – Index Seek Vs. Index Scan and Optimizing Your Query Plans with the SQL

Query is too slow, and not using index

here is the "explain" of my query:
explain
select eil.sell_fmt, count(sell_fmt) as itemCount
from table_items eil
where eil.cl_Id=123 and eil.si_Id='0'
and start_date <= now() and end_date is not null and end_date < NOW()
group by eil.sell_fmt
without date (start_date, end_date) filters:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE eil ref table_items_clid_siid_sellFmt 39 const,const 7393 Using where; Using index
With date filters:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE eil ref table_items_clid_siid_sellFmt 39 const,const 8400 Using where
possible_keys are:
table_items_clid_siid, table_items_clid_siid_itemId, table_items_clid_siid_startDate_endDate, table_items_clid_siid_sellFmt
The query without date filters is very fast (0.4 sec), but with date filters, its taking about 30 seconds. total records are 14K only.
Table field types:
`cl_Id` int(11) NOT NULL,
`si_Id` varchar(11) NOT NULL,
`start_date` datetime DEFAULT NULL,
`end_date` datetime DEFAULT NULL,
`sell_fmt` varchar(20) DEFAULT NULL
I concatenated field-names to give index names, so you can estimate combined fields available in the index.
Can somebody guide me here? what's going on here? what is the best course of action i should take here, or where i'm doing wrong?
I need one more suggestion: in another query on same table, a user can filter based on UPTO 10 fields, and in no definite order of fields (random no of fields in random order). Then this type search would be too slow again. What's the best strategy then? one covering index with "all" possible searchable fields? if yes, does the order of fields in index matter? (i.e. if that order is different than that of fields in query, will the index be used?

First, without seeing your create table statement, I can offer the following... create composite index (multiple fields) that best match your common querying elements applicable to the where clause, starting with the smaller nominal count basis. Since you are explicitly looking for a "cl_ID" and "si_ID" plus start and end dates. Since you have a group by, I would add that to the index for optimization purposes and be a completely COVERING index so the engine does not need to go back to the raw data to complete the query. It can resolve by all fields in the index directly.
I would have an index on
( cl_id, si_id, start_date, end_date, sell_fmt )
Finally, change your count from count(sell_fmt) to just count(*) indicating an "i don't care about a specific field, just as long as a record is found, count it"

mysql index issue-explain says file sort

i have very simple mysql table with 5 columns but 5 million data. earlier when data was less my server load was very less but now the load is increasing as the data is more than 5 million and i expect it to reach 10 million by this year end so server will be more slow.i have used indexed wisely
structure is very simple with id as auto increment and primary key and i am filtering the data based on id only which is automatically indexed as it is primary key(i tried indexing it as well but no benefit)
table A
id pid title app get
my query is
EXPLAIN SELECT * FROM tableA ORDER BY id DESC LIMIT 4061280 , 10
and explain says
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tableA ALL NULL NULL NULL NULL 4700461 Using filesort
i dont want to go through all rows as it will slow down my server and create heavy load for file sorting as it will create temporary files either in buffer or in disc.
please advice any good idea to solve this issue.
when my id is indexed why it will go through all rows and reach to desired row.it can not jump directly to that row????

Assuming you don't have "gaps" (read deleted records) in your id..
SELECT * FROM tableA WHERE id > 4061279 and id <= 4061290 ORDER BY id DESC
Ok next
SELECT * FROM tableA WHERE id <= 4061290 ORDER BY id DESC LIMIT 10

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008