SELECT COUNT(*) with GROUP BY slow for large table

SELECT COUNT(*) with GROUP BY slow for large table - mysql

I have a table with around 100 million rows consisting of three columns (all INT):
id | c_id | l_id
Even though I use indices even a basic
select count(*), c_id
from table
group by c_id;
takes 16 seconds (MYISAM) to 25 seconds (InnoDB) to complete.
Is there any way to speed up this process without tracking the count in a seperate table (e.g. by using triggers)?
/edit: all colums have indices

See execution plan for possible ways to do the same queries SqlFiddle,
SELECT COUNT(id) will be faster if c_id is not indexed on the test set i have provided.
otherwise you should use COUNT(*) since optimization of index may not be used in the query.
It is also dependent of the number of rows in the DB and the ENGINE type , since mysql will decide what is better based on this fact also.
You should always see the execution plan of the query before executing it by typing EXPLAIN before the select.
I have to say that in most cases on big datasets, COUNT(*) and COUNT(id) should result in the same execution plan.

It's not the Count(*) that gives the performance issue but grouping on 100 million rows.
You should add an index on the c_id column

Related

SQl query optimization in workbench

Running a query in SQL takes a lot of time.
There are 240000000 total rows and 7700000 unique rows.
Try to calculate the average daily step count of the user between 3000 and 4000.
select count(distinct user_id) from (SELECT user_id,ROUND(AVG(IF(steps>'0',steps,NULL)),0) AS `Average Steps`
FROM `step_activity`.`step_activities` where user_id between '1100001' and '9999999' group by user_id
having `Average Steps` between '3000' and '4000') as custlt3k;
just want to know the total number of users.###

The GROUP BY in the derived table (inner query) makes the user_id distinct. Hence, the DISTINCT is not needed. CHange to simply COUNT(*).
Please provide SHOW CREATE TABLE so we can see if you have an index starting with user_id, which might be beneficial. Also how many rows in step_activities and how many rows with user_id between '1100001' AND '9999999'. Comparing those will determine whether the index will even be used.
The task requires all the rows, at least the rows with that range of users, to be read, and "grouped".
This index may help: INDEX(user_id, steps) because that would be a "covering" index.
Another thing to consider -- don't store any rows with steps = 0. After all, they are being thrown out in this query. (Maybe there are other columns in the row that you need to keep?)

SQL gets slow on a simple query with ORDER BY

I have problem with MySQL ORDER BY, it slows down query and I really don't know why, my query was a little more complex so I simplified it to a light query with no joins, but it stills works really slow.
Query:
SELECT
W.`oid`
FROM
`z_web_dok` AS W
WHERE
W.`sent_eRacun` = 1 AND W.`status` IN(8, 9) AND W.`Drzava` = 'BiH'
ORDER BY W.`oid` ASC
LIMIT 0, 10
The table has 946,566 rows, with memory taking 500 MB, those fields I selecting are all indexed as follow:
oid - INT PRIMARY KEY AUTOINCREMENT
status - INT INDEXED
sent_eRacun - TINYINT INDEXED
Drzava - VARCHAR(3) INDEXED
I am posting screenshoots of explain query first:
The next is the query executed to database:
And this is speed after I remove ORDER BY.
I have also tried sorting with DATETIME field which is also indexed, but I get same slow query as with ordering with primary key, this started from today, usually it was fast and light always.
What can cause something like this?

The kind of query you use here calls for a composite covering index. This one should handle your query very well.
CREATE INDEX someName ON z_web_dok (Drzava, sent_eRacun, status, oid);
Why does this work? You're looking for equality matches on the first three columns, and sorting on the fourth column. The query planner will use this index to satisfy the entire query. It can random-access the index to find the first row matching your query, then scan through the index in order to get the rows it needs.
Pro tip: Indexes on single columns are generally harmful to performance unless they happen to match the requirements of particular queries in your application, or are used for primary or foreign keys. You generally choose your indexes to match your most active, or your slowest, queries. Edit You asked whether it's better to create specific indexes for each query in your application. The answer is yes.

There may be an even faster way. (Or it may not be any faster.)
The IN(8, 9) gets in the way of easily handling the WHERE..ORDER BY..LIMIT completely efficiently. The possible solution is to treat that as OR, then convert to UNION and do some tricks with the LIMIT, especially if you might also be using OFFSET.
( SELECT ... WHERE .. = 8 AND ... ORDER BY oid LIMIT 10 )
UNION ALL
( SELECT ... WHERE .. = 9 AND ... ORDER BY oid LIMIT 10 )
ORDER BY oid LIMIT 10
This will allow the covering index described by OJones to be fully used in each of the subqueries. Furthermore, each will provide up to 10 rows without any temp table or filesort. Then the outer part will sort up to 20 rows and deliver the 'correct' 10.
For OFFSET, see http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or

SQL: Optimize the query on large table with indexing

For example, I have the following table:
table Product
------------
id
category_id
processed
product_name
This table has index on columns id category_id and processed and (category_id, proccessed). The statistic on this table is:
select count(*) from Product; -- 50M records
select count(*) from Product where category_id=10; -- 1M records
select count(*) from Product where processed=1; -- 30M records
My simplest query I want to query is: (select * is the must).
select * from Product
where category_id=10 and processed=1
order by id ASC LIMIT 100
The above query without limit only has about 10,000 records.
I want to call the above query for multiple time. Every time I get out I will update field processed to 0. (so it will not appear on the next query). When I test on the real data, sometime the optimizer try to use id as the key, so it cost a lot of time.
How can I optimize the above query (In general term)
P/S: for avoiding confuse, I know that the best index should be (category, processed, id). But I cannot change the index. My question is just only related to optimize the query.
Thanks

For this query:
select *
from Product
where category_id = 10 and processed = 1
order by id asc
limit 100;
The optimal index is on product(category_id, processed, id). This is a single index with a three-part key, with the keys in this order.

Given that you have INDEX(category_id, processed), there is virtually no advantage in also having just INDEX(category_id). So DROP the latter.
That may have the beneficial side effect of pushing the Optimizer toward the composite INDEX(category_id, processed), which is at least "better" for the query.
Without touching the indexes, you could use a FORCE INDEX mentioning the composite index's name. But I don't recommend it. "It may help today, but hurt tomorrow, after the data changes."
Why do you say "But I cannot change the index."? Newer version of MySQL/MariaDB make ADD/DROP INDEX much faster than older versions. Also, pt-online-schema-change is provides a fast way.

Mysql query group by 2 million records takes 7 seconds

I had a query with group by condition takes 7 seconds. I had optimized all its data type and also given index.
Before index Execution time: 15 sec
After index and optimize data type its taken time: 7 seconds.
My database is much larger and I have to run approx 15 queries to prepare my report
the single query takes 7 sec then 15*7 = 105 sec it's too much time-consuming.
Sample Query:
Select
ca_id,
count(ca_id),
email,
cadate
from
tbl_zonec
group by
email,
cadate,
ca_id
Is there any way to optimize the query performance?

Get rid of id; it is just plain wrong not to have it in the GROUP BY. Ditto for sample.
If ca_id cannot be NULL, say COUNT(*).
You seem to be scanning the entire table every time. Do you get different results? Is the table being added to? Is old data being modified or deleted, or is this effectively a write-once table? If the latter, then consider building and maintaining a Summary table; it will be a lot faster. (Perhaps 10-fold.)
Please provide SHOW CREATE TABLE. What is the PRIMARY KEY? If it is does not start with email, cadate, ca_id, then there is an improvement possible.
Without changing the PK, this would run a little faster: INDEX(email, cadate, ca_id, sample, id) -- it is optimal for the GROUP BY and it is "covering".

Should I COUNT(*) or not?

I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.

If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.

If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.

COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.

if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.

I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.

Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.

MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.

An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.

COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause

It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`

It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure

if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns

The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.

I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008