"SELECT COUNT(*)" is slow, even with where clause - mysql

I'm trying to figure out how to optimize a very slow query in MySQL (I didn't design this):
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
+----------+
| COUNT(*) |
+----------+
| 3224022 |
+----------+
1 row in set (1 min 0.16 sec)
Comparing that to a full count:
select count(*) from change_event;
+----------+
| count(*) |
+----------+
| 6069102 |
+----------+
1 row in set (4.21 sec)
The explain statement doesn't help me here:
explain SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: me
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: NULL
rows: 4120213
Extra: Using where; Using index
1 row in set (0.00 sec)
OK, it still thinks it needs roughly 4 million entries to count, but I could count lines in a file faster than that! I don't understand why MySQL is taking this long.
Here's the table definition:
CREATE TABLE `change_event` (
`change_event_id` bigint(20) NOT NULL default '0',
`timestamp` datetime NOT NULL,
`change_type` enum('create','update','delete','noop') default NULL,
`changed_object_type` enum('Brand','Broadcast','Episode','OnDemand') NOT NULL,
`changed_object_id` varchar(255) default NULL,
`changed_object_modified` datetime NOT NULL default '1000-01-01 00:00:00',
`modified` datetime NOT NULL default '1000-01-01 00:00:00',
`created` datetime NOT NULL default '1000-01-01 00:00:00',
`pid` char(15) default NULL,
`episode_pid` char(15) default NULL,
`import_id` int(11) NOT NULL,
`status` enum('success','failure') NOT NULL,
`xml_diff` text,
`node_digest` char(32) default NULL,
PRIMARY KEY (`change_event_id`),
KEY `idx_change_events_changed_object_id` (`changed_object_id`),
KEY `idx_change_events_episode_pid` (`episode_pid`),
KEY `fk_import_id` (`import_id`),
KEY `idx_change_event_timestamp_ce_id` (`timestamp`,`change_event_id`),
KEY `idx_change_event_status` (`status`),
CONSTRAINT `fk_change_event_import` FOREIGN KEY (`import_id`) REFERENCES `import` (`import_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Version:
$ mysql --version
mysql Ver 14.12 Distrib 5.0.37, for pc-solaris2.8 (i386) using readline 5.0
Is there something obvious I'm missing? (Yes, I've already tried "SELECT COUNT(change_event_id)", but there's no performance difference).

InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages. In order to do a range scan you still have to scan through all of the potentially wide rows in data pages; note that this table contains a TEXT column.
Two things I would try:
run optimize table. This will ensure that the data pages are physically stored in sorted order. This could conceivably speed up a range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id column. This will store a copy of that column in index pages which be much faster to scan. After creating it, check the explain plan to make sure it's using the new index.
(you also probably want to make the change_event_id column bigint unsigned if it's incrementing from zero)

Here are a few things I suggest:
Change the column from a "bigint" to an "int unsigned". Do you really ever expect to have more than 4.2 billion records in this table? If not, then you're wasting space (and time) the the extra-wide field. MySQL indexes are more efficient on smaller data types.
Run the "OPTIMIZE TABLE" command, and see whether your query is any faster afterward.
You might also consider partitioning your table according to the ID field, especially if older records (with lower ID values) become less relevant over time. A partitioned table can often execute aggregate queries faster than one huge, unpartitioned table.
EDIT:
Looking more closely at this table, it looks like a logging-style table, where rows are inserted but never modified.
If that's true, then you might not need all the transactional safety provided by the InnoDB storage engine, and you might be able to get away with switching to MyISAM, which is considerably more efficient on aggregate queries.

I've run into behavior like this before with IP geolocation databases. Past some number of records, MySQL's ability to get any advantage from indexes for range-based queries apparently evaporates. With the geolocation DBs, we handled it by segmenting the data into chunks that were reasonable enough to allow the indexes to be used.

Check to see how fragmented your indexes are. At my company we have a nightly import process that trashes our indexes and over time it can have a profound impact on data access speeds. For example we had a SQL procedure that took 2 hours to run one day after de-fragmenting the indexes it took 3 minutes. we use SQL Server 2005 ill look for a script that can check this on MySQL.
Update: Check out this link: http://dev.mysql.com/doc/refman/5.0/en/innodb-file-defragmenting.html

Run "analyze table_name" on that table - it's possible that the indices are no longer optimal.
You can often tell this by running "show index from table_name". If the cardinality value is NULL then you need to force re-analysis.

MySQL does say "Using where" first, since it does need to read all records/values from the index data to actually count them. With InnoDb it also tries to "grab" that 4 mil record range to count it.
You may need to experiment with different transaction isolation levels: http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html#isolevel_read-uncommitted
and see which one is better.
With MyISAM it would be just fast, but with intensive write model will result in lock issues.

To make the search more efficient, although I recommend adding index. I leave the command for you to try the metrics again
CREATE INDEX ixid_1 ON change_event (change_event_id);
and repeat query
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
-JACR

I would create a "counters" table and add "create row"/"delete row" triggers to the table you are counting. The triggers should increase/decrease count values on "counters" table on every insert/delete, so you won't need to compute them every time you need them.
You can also accomplish this on the application side by caching the counters but this will involve clearing the "counter cache" on every insertion/deletion.
For some reference take a look at this http://pure.rednoize.com/2007/04/03/mysql-performance-use-counter-tables/

Related

Poor query with table scans occasionally takes hours on MariaDB

My application uses a MariaDB database which I try to keep isolated, but one particular user goes straight to the database and started complaining today after 6 weeks without incident that one of their queries slowed down from 5 mins (which I thought was bad enough) to over 120 mins.
Since then today it has sometimes been as fast as usual, sometimes slowing down again.
This is their query:
SELECT MAX(last_updated) FROM data_points;
This is the table:
CREATE TABLE data_points (
seriesId INT UNSIGNED NOT NULL,
modifiedDate DATE NOT NULL,
valueDate DATE NOT NULL,
value DOUBLE NOT NULL,
created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_updated DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP()
ON UPDATE CURRENT_TIMESTAMP,
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
CONSTRAINT pk_data PRIMARY KEY (seriesId, modifiedDate, valueDate),
KEY ix_data_modifieddate (modifiedDate),
KEY ix_data_id (id),
CONSTRAINT fk_data_seriesid FOREIGN KEY (seriesId)
REFERENCES series(id)
) ENGINE=InnoDB
DEFAULT CHARSET=utf8mb4
COLLATE=utf8mb4_unicode_ci
MAX_ROWS=222111000;
and this is the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE data_points ALL NULL NULL NULL NULL 224166191
The table has approx 250M rows and is growing relatively fast.
I can coerce the user into doing something more sensible but in the short term I'm keen to understand why the query duration is going crazy today after 6 weeks of calm. I'll accept the first answer that can explain that.
SELECT MAX(last_updated) FROM data_points; is easily optimized:
INDEX(last_updated)
That index will make that MAX be essentially instantaneous. And it will avoid pounding on the disk and cache (see below).
Two things control the un-indexed speed:
The size of the table, which is "growing relatively fast", and
[This is probably what you are fishing for.] How much of the table is cached when the query is run. This can make a 10x difference in the speed. You can partially test this claim thus:
Restart mysqld; time the query; time it again. The first run had to hit the disk a lot (because of the fresh restart); the second may have found everything in RAM.
Another thing that can mess with the timings: If some other 'big' query is run and it bumps blocks of this table out of cache, then the query will again be slow.
Of relevance: Size of table, value of innodb_buffer_pool_size, and amount of RAM.
On an unrelated topic... That PRIMARY KEY (seriesId, modifiedDate, valueDate) seems strange. A PK is must be unique. Dates (datetime, etc) are likely to have multiple entries for the same day/second; so can you be sure of uniqueness? Especially with 2 dates?
(More)
Please explain the meaning of each of the 4 dates. And ask yourself if they are all needed. (About half the bulk of the table is those dates!)
The table has an AUTO_INCREMENT; is it needed by some other table? If not then either it could be removed, or it could be used to assure that the PK is unique.
To better help you, we need to see more of the queries.

Order By causing my query to run really slow

I have an sql query as follows
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
order by reported_at desc
limit 1;
This query at the moment takes 313.24 secs to run.
If I remove the order by so the query is
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
then it only takes 0.117 secs to run.
The reported_at column is indexed.
So 2 questions, firstly why is it takings so long with this order_by statement and secondly how can i speed it up?
EDIT: In response to the questions below here is the output when using explain:
'1', 'SIMPLE', 'incidents', 'index', 'uniqueReportIndex,idx_incidents_remote_ip', 'incidentsReportedAt', '4', NULL, '1044', '100.00', 'Using where'
And the table create statement:
CREATE TABLE `incidents` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`incident_ip_id` int(10) unsigned DEFAULT NULL,
`remote_id` bigint(20) DEFAULT NULL,
`remote_ip` char(32) NOT NULL,
`is_infringement` tinyint(1) NOT NULL DEFAULT '0',
`messageBody` text,
`reported_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Formerly : created_datetime',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueReportIndex` (`remote_ip`,`host_id_1`,`licence_feature`,`app_end`),
UNIQUE KEY `uniqueRemoteIncidentId` (`remote_id`),
KEY `incident_ip_id` (`incident_ip_id`),
KEY `id` (`id`),
KEY `incidentsReportedAt` (`reported_at`),
KEY `idx_incidents_remote_ip` (`remote_ip`)
)
Note: i have omitted some of the non relevant fields so there are more indexes than fields but you can safely assume the fields for all the indexes are in the table
The output of EXPLAIN reveals that, because of the ORDER BY clause, MySQL decides to use the incidentsReportedAt index. It reads each row from the table data in the order provided by the index and checks the WHERE conditions on it. This requires reading a lot of information from the table data, information that is scattered through the entire table. Not a good workflow.
Update
The OP created an index on columns reported_at and report_ip (as suggested in the original answer, see below) and the execution time went down from 313 to 133 seconds. An improvement, but not enough. I think the cause of this still large execution time is the access to table data for each row to verify the is_infringement = 1 part of the WHERE clause but even adding it to the index won't help very much.
The OP says in a comment:
Ok after further research and changing the index to be the other way round (remote_ip, reported_at) the query is now super fast (0.083 sec).
This index is better, indeed, because the remote_ip = '192.168.1.1' condition filters out a lot of rows. The same effect can be achieved using the existing uniqueReportIndex index. It is possible that the original index on reported_at fooled MySQL into thinking it is better to use it to check the rows in the order required by ORDER BY instead of filtering first and sorting at the end.
I think MySQL uses the new index on (remote_ip, reported_at) for filtering (WHERE remote_ip = '192.168.1.1') and for sorting (ORDER BY reported_at DESC). The WHERE condition provides a small list of candidate rows that are easily identified and also sorted using this index.
The original answer follows.
The advice it provides is not correct but it helped the OP find the correct solution.
Create an index on columns reported_at and report_ip in this order
then see what EXPLAIN says and how the query performs. It should work faster.
You can even create the new index on columns reported_at, report_ip and is_infringement (the order of columns in the index is very important).
The index on three columns helps MySQL identify the rows without the need to read the table data (because all the columns from WHERE and ORDER BY clauses are in the index). It needs to read the table data only for the rows it returns because of SELECT *.
After you create the new index (either on two or three columns), remove the old index incidentsReportedAt. It is not needed any more; it uses disk and memory space and takes time to be updated but it is not used. The new index (that has the reported_at column on the first position) will be used instead.
The index on two columns requires more reads of the table data for the is_infringement = 1 condition. The query probably runs a little slower that with the three-columns index. On the other hand, there is a little gain on table updates and disk and memory space usage.
The decision to index on two or three columns depends on how often the query posted in the question runs and what it serves (visitors, admins, cron jobs etc).

How can I select a set of IDs from large table fast?

I have a large table with ID as primary. About 3 million rows and I need to extract a small set of rows base on given ID list.
Currently I am doing it on where... in but it's very slow, like 5 to 10s.
My code:
select id,fa,fb,fc
from db1.t1
where id in(15,213,156,321566,13,165,416,132163,6514361,... );
I tried to query one ID at a time but it is still slow. like
select id,fa,fb,fc from db1.t1 where id =25;
I also tried to use a temp table and insert the ID list and call Join. But no improvement.
select id,fa,fb,fc from db1.t1 inner join db1.temp on t1.id=temp.id
Is there any way to make it faster?
here is table.
CREATE TABLE `db1`.`t1` (
`id` int(9) NOT NULL,
`url` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
`title` varchar(1024) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastUpdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`lastModified` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Ok here is explain select.
id=1,
select_type='SIMPLE',
table='t1',
type='range',
possible_keys='PRIMARY',
key='PRIMARY',
key_len= '4',
ref= '',
rows=9,
extra='Using where'
Here are some tips how you can speed up the performance of your table:
Try to avoid complex SELECT queries on MyISAM tables that are updated
frequently, to avoid problems with table locking that occur due to
contention between readers and writers.
To sort an index and data according to an index, use myisamchk
--sort-index --sort-records=1 (assuming that you want to sort on index 1). This is a good way to make queries faster if you have a
unique index from which you want to read all rows in order according
to the index. The first time you sort a large table this way, it may
take a long time.
For MyISAM tables that change frequently, try to avoid all
variable-length columns (VARCHAR, BLOB, and TEXT). The table uses
dynamic row format if it includes even a single variable-length
column.
Strings are automatically prefix- and end-space compressed in MyISAM
indexes. See “CREATE INDEX Syntax”.
You can increase performance by caching queries or answers in your
application and then executing many inserts or updates together.
Locking the table during this operation ensures that the index cache
is only flushed once after all updates. You can also take advantage
of MySQL's query cache to achieve similar results; see “The MySQL Query Cache”..
You can read further on this articles on Optimizing your queries.
MySQL Query Cache
Query Cache SELECT Options
Optimizing MySQL queries with IN operator
Optimizing MyISAM Queries
First of all clustered indexes are faster then non-clustered indexes if I am not wrong.
Then sometime even you have index on a table, try to create re-index, or create statistics to rebuild it.
I saw on SQL explain plan that when we use where ID in (...), it converts it to
Where (ID =1) or (ID=2) or (Id=3)..... so bigger the list many ors, so for very big tables avoid IN ()
Try "Explain" this SQL and it can tell you where is the actual bottle neck.
Check this link http://dev.mysql.com/doc/refman/5.5/en/explain.html
hope will work
Looks like original sql statement using 'in' should be fine since the Id columns is indexed
I think you basically need a faster computer - are you doing this query on shared hosting?

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

Effective indexing for a DB with millions of rows

I have a MYISAM MySQL DB table with many millions of rows inside which I've been asked to work with, but I need to make the queries faster first.
There was no indexing before at all! I added a new index on the 'type' column which has helped but I wanted to know if there were any other columns that might be best indexed too?
Here is my CREATE TABLE:
CREATE TABLE `clicks` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`companyid` int(11) DEFAULT '0',
`type` varchar(32) NOT NULL DEFAULT '',
`contextid` int(11) NOT NULL DEFAULT '0',
`period` varchar(16) NOT NULL DEFAULT '',
`timestamp` int(11) NOT NULL DEFAULT '0',
`location` varchar(32) NOT NULL DEFAULT '',
`ip` varchar(32) DEFAULT NULL,
`useragent` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `companyid` (`companyid`,`type`,`period`),
KEY `type` (`type`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
A typical SELECT statement would commonly filter by the companyid, type and contextid columns.
For example:
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
or
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND type IN('direct') AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
The last part of my questions would be this: when I added the index on type it took about 1 hour - if I am adding or removing multiple indexes, can you do it in one query or do you have to do them 1 by 1 and wait for each to finish?
Thanks for your thoughts.
Indexing is really powerful, but isn't as much of a black art as you might think. Learn about MySQL's EXPLAIN PLAN capabilities, this will help you systematically find where improvements can be made:
http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html
Which indexes to add really depends on your queries. Anything that you're sorting (GROUP BY) or selecting (WHERE) on is a good candidate for an index.
You may also want to have a look at how Mysql uses indexes.
As regards the time taken to add indexes, where you're sure you want to add multiple indexes, you could do mysqldump, manually edit the table structure in the .sql file, and then reimport. This can take a while, but at least you can do all the changes at once. However, this doesn't really fit with the idea of testing as you go... so use this approach with care. (I've done it when modifying a number of tables with the same structure, and wanting to add some indexes to all of them.)
Also, I'm not a 100% sure, but I think that when you add an index, Mysql creates a copy of the table with the index, and then deletes the original table - so make sure there's enough space on your server / partition for the current size of the table & some margin.
Here's one of your queries, broken on multiple lines so it's easier to read.
SELECT period, count(period) as count
FROM clicks
WHERE contextid in (123)
AND timestamp > 123123123
GROUP BY period
ORDER BY timestamp ASC
I'm not even sure this is a valid query. I thought your GROUP BY and ORDER BY had to match up in SQL. I think that you would have to order on count, as the GROUP BY would order on period.
The important part of the query for optimization is the WHERE clause. In this case, an index on contextid and timestamp would speed up the query.
Obviously, you can't index every WHERE clause. You index the most common WHERE clauses.
I'd add indexes to existing tables one at a time. Yes, it's slow. But you should only have to add the indexes once.
In my opinion timestamp and period can be indexed as they are being used in the WHERE clause.
Also instead of using contextid in (123) use contextid = 123 and instead of type IN('direct') use type = 'direct'
You can add multiple indexes in a single query. This will save some time overall, but the table will be inaccessible while you wait for the entire query to complete:
ALTER TABLE table1 ADD INDEX `Index1`('col1'),
ADD INDEX `Index2`('col2')
Regarding indexes, it's a complex subject. However, adding indexes on single columns with high-cardinality that are included in your WHERE clause is a good place to start. MySQL will try to pick the best index for the query and use that.
To further tweak performance, you should consider multi-column indexes, which I see you've implemented with your 'companyid' index.
To be able to utilize an index all the way through to a GROUP BY or ORDER BY clause relies on a lot of conditions, that you might want to read up on.
To best utilize indexes, your database server must have enough RAM to store the indexes entirely in memory and the server must be configured properly to actually utilize the memory.