Counting Records COUNT or SUM or ROW? - mysql

This table contains server monitoring records. Once the server fails to ping, it inserts new records. So one server can fail multiple times. I want to get the count of records how many times SERVER 3 fails.
This is the table where failure_id is Primary Key.
failure_id server_id protocol added_date
---------- --------- -------- ---------------------
1 1 HTTP 2013-02-04 15:50:42
2 3 HTTP 2013-02-04 16:35:20
Using (*) to count the rows
SELECT
COUNT(*) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using server_id to count the rows
SELECT
COUNT(`f`.`server_id`) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using SUM to count the rows
SELECT
IFNULL(SUM(1), 0) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
All the above queries return the correct output. But my database will be very large in the future. Which method is best to use based on performance? Thanks in advance...

I'd say none of the above. If you have control over the app that's inserting the records that is. If so, if you don't have a table for your servers, just create one. otherwise add a field called current_failure_count or something and stick it in that table. So when you insert the record, also do an update on your server table and set current_failure_count = current_failure_count + 1 for that server. That way you have to only read one record in the server table (indexed by server_id I'd assume) and you're set. No, this does not follow any of the normalization rules, but you are seeking speed and this is the best way to get it if you can control the client software.
If you cannot control the client software, perhaps you can put a trigger on the insert of records into the failures table that increments the current_failure_count value in the servers table. that should work as well.

Well, the second is definitely more efficient than the first.
I recommend you create a view for the server, which will severely speed things up
CREATE VIEW server3 AS
SELECT server_id
FROM failures
CAST(`f`.`server_id` AS CHAR) = 3;
Then Simply run a count on that view as if it was a table!

Like others, it's not clear to me why you're casting the server_id value. That is going to cost you more performance than any other issue.
If you can eliminate that cast so that you're searching WHERE server_id = (value) and you create an index on server_id then either of the first two queries you suggested will be able to perform index-only retrieval and will provide optimal performance.

SELECT COUNT(*) AS `total`
FROM failures f
WHERE f.server_id = 3;
count(*) will always be better than the arithmetic calculation, although applying index will give more faster result in this.
second best solution will be
SELECT IFNULL(SUM(1), 0) AS `total`
FROM failures `f`
WHERE f.server_id = 3;
this method is used my SQL engine of many tools such as microstrategy
hope answer helps...:)

Related

MySQL query takes more time to fetch data [MySQL]

I have 500000 records table in my MySQL server. When running a query it takes more time for query execution. sometimes it goes beyond a minute.
Below I have added my MySQL machine detail.
RAM-16GB
Processor : Intel(R) -Coreā„¢ i5-4460M CPU #3.20GHz
OS: Windows server 64 bit
I know there is no problem with my machine since it is a standalone machine and no other applications there.
Maybe the problem with my query. I have gone through the MySql site and found that I have used proper syntax. But I don't know exactly the reason for the delay in the result.
SELECT SUM(`samplesalesdata50000`.`UnitPrice`) AS `UnitPrice`, `samplesalesdata50000`.`SalesChannel` AS `Channel`
FROM `samplesalesdata50000` AS `samplesalesdata50000`
GROUP BY `samplesalesdata50000`.`SalesChannel`
ORDER BY 2 ASC
LIMIT 200 OFFSET 0
Can anyone please let me know whether the duration, depends on the table or the query that I have used?
Note: Even if try with indexing, there is no much difference in result time.
Thanks
Two approaches to this:
One approach is to create a covering index on the columns needed to satisfy your query. The correct index for your query contains these columns in this order: (SalesChannel, UnitPrice).
Why does this help? For one thing, the index itself contains all data needed to satisfy your query, and nothing else. This means your server does less work.
For another thing, MySQL's indexes are BTREE-organized. That means they're accessible in order. So your query can be satisfied one SalesChannel at a time, and MySQL doesn't need an internal temporary table. That's faster.
A second approach involves recognizing that ORDER BY ... LIMIT is a notorious performance antipattern. You require MySQL to sort a big mess of data, and then discard most of it.
You could try this:
SELECT SUM(UnitPrice) UnitPrice,
SalesChannel Channel
FROM samplesalesdata50000
WHERE SalesChannel IN (
SELECT SalesChannel
FROM samplesalesdata50000
ORDER BY Channel LIMIT 200 OFFSET 0
)
GROUP BY SalesChannel
ORDER BY SalesChannel
LIMIT 200 OFFSET 0
If you have an index on SalesChannel (the covering index mentioned above works) this should speed you up a lot, because your aggregate (GROUP BY) query need only consider a subset of your table.
Your problem with "ORDER BY 2 ASC". Try this "ORDER BY Channel".
If it was MS SQL Server you would use the WITH (NOLOCK)
and the MYSQL equivalent is
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
SELECT SUM(`samplesalesdata50000`.`UnitPrice`) AS `UnitPrice`, `samplesalesdata50000`.`SalesChannel` AS `Channel`
FROM `samplesalesdata50000` AS `samplesalesdata50000`
GROUP BY `samplesalesdata50000`.`SalesChannel`
ORDER BY SalesChannel ASC
LIMIT 200 OFFSET 0
COMMIT ;
To improve on OJones's answer, note that
SELECT SalesChannel FROM samplesalesdata50000
ORDER BY SalesChannel LIMIT 200, 1
will quickly (assuming the index given) find the end of the desired list. Then adding this limits the main query to only the rows needed:
WHERE SalesChannel < (that-select)
There is, however, a problem. If there are fewer than 200 rows in the table, the subquery will return nothing.
You seem to be setting up for "paginating"? In that case, a similar technique can be used to find the starting value:
WHERE SalesChannel >= ...
AND SalesChannel < ...
This also avoids using the inefficient OFFSET, which has to read, then toss, all the rows being skipped over. More
But the real solution may be to build and maintain a Summary Table of the data. It would contain subtotals for each, say, month. Then run the query against the Summary table -- it might be 10x faster. More

MySQL: How to achieve row-level transaction locking instead of table locking

Here's the use case:
I have a table with a bunch of unique codes which are either available or not available. As part of a transaction, I want to select a code that is available from the table, then later update that row later in the transaction. Since this can happen concurrently for a lot of sessions at the same time, I want to ideally select a random record and use row-level locking on the table, so that other transactions aren't blocked by the query which is selecting a row from the table.
I am using InnoDB for the storage engine, and my query looks something like this:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
However, rather than locking just one row from the table, it ends up locking the whole table. Can anyone give me some pointers on how to make it so that this query doesn't lock the whole table but just the row?
Update
Addendum: I was able to achieve row-level locking by specifying an explicit key in my select rather than doing the rand(). When my queries look like this:
Query 1:
select * from tbl_codes where available = 1 and id=5 limit 1 for update
Query 2:
select * from tbl_codes where available = 1 and id=10 limit 1 for update
However, that doesn't really help solve the problem.
Addendum 2: Final Solution I went with
Given that rand() has some issues in MySQL, the strategy I chose is:
I select 50 code id's where available = 1, then I shuffle the array in the application layer to add a level of randomness to the order.
select id from tbl_codes where available = 1 limit 50
I start popping codes from my shuffled array in a loop until I am able to select one with a lock
select * from tbl_codes where available = 1 and id = :id
It may be useful to look at how this query is actually executed by MySQL:
select * from tbl_codes where available = 1 order by rand() limit 1 for update
This will read and sort all rows that match the WHERE condition, generate a random number using rand() into a virtual column for each row, sort all rows (in a temporary table) based on that virtual column, and then return rows to the client from the sorted set until the LIMIT is reached (in this case just one). The FOR UPDATE affects locking done by the entire statement while it is executing, and as such the clause is applied as rows are read within InnoDB, not as they are returned to the client.
Putting aside the obvious performance implications of the above (it's terrible), you're never going to get reasonable locking behavior from it.
Short answer:
Select the row you want, using RAND() or any other strategy you like, in order to find the PRIMARY KEY value of that row. E.g.: SELECT id FROM tbl_codes WHERE available = 1 ORDER BY rand() LIMIT 1
Lock the row you want using its PRIMARY KEY only. E.g.: SELECT * FROM tbl_codes WHERE id = N
Hopefully that helps.
Even if not exactly mapping to your question, the problem is somewhat discussed here: http://akinas.com/pages/en/blog/mysql_random_row/
The problem with this method is that it is very slow. The reason for
it being so slow is that MySQL creates a temporary table with all the
result rows and assigns each one of them a random sorting index. The
results are then sorted and returned.
The article does not deal with locks. However, maybe MySQL locks all the rows having available = 1 and does not release them until the end of the transaction!
That article proposes some solution, none of them seems to be good for you, except this one which is, unfortunately, very hacky and I didn't probe its correctness.
SELECT * FROM table WHERE id >= (SELECT FLOOR( MAX(id) * RAND())
FROM table ) ORDER BY id LIMIT 1;
This is the best I can do for you since I don't command MySQL internals. Moreover, the article is pretty old.

Which is a less expensive query count(id) or order by id

I'd like to know which of the followings would execute faster in MySQL database. The table would have 200 - 1000 entries.
SELECT id
from TABLE
order by id desc
limit 1
or
SELECT count(id)
from TABLE
The story is the Table is cached. So this query is to be executed every time before cache retrieval to determine whether the cache data is invalid by comparing the previous value.
So if there exists a even less expensive query, please kindly let me know. Thanks.
If you
start from 1
never have any gaps
use the InnoDB engine
id is not nullable
Then the 2nd could run [ever so marginally] faster due to not having to visit table data at all (count is stored in metadata).
Otherwise,
if the table has NO index on ID (causing a SCAN), the 2nd one is faster
Barring both the above
the first one is faster
And if you actually meant to ask SELECT .. LIMIT 1 vs SELECT MAX(id).. then the answer is actually that they are the same for MySQL and most sane DBMS, whether or not there is an index.
I think, the first query will run faster, as the query is limited to be executed for one row only, 200-1000 may not matter that much in this case.
As already pointed out in the comments, your table is so small it really doesn't what your solution will be. For this reason the select count(id) should be used as it expresses the intent and doesn't need any further processing.
Now select count(id) comes with an alternative select count(*). These two are not synonyms. select count(*) will count the number of rows and use a cached value if possible when select count(id) counts the number of non null values of the column id exists. If the id columns is set as not null then the cached row count may be used.
The selection between count(*) and count(id) depends once again on your intent. In the general case, count(*) describes the intent better.
The there is the possibility of count(1) which is actually a synonym of count(*) when using mysql but the interpretation may vary if end up using a different RDBMS.
The performance of each type of count also varies depending on whether you are using MyISAM or InnoDB. The row counts are cached on the former but not on the latter, if I've understood correctly.
In the end, you should rely on query plans and running tests and measuring their performance rather than these general ramblings.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.

Should I COUNT(*) or not?

I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.
If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.
If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.
COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.
if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.
I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.
Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.
MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.
An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.
COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause
It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`
It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure
if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns
The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.
I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.