I am running sql queries on a mysql db table that has 110Mn+ unique records for whole day.
Problem: Whenever I run any query with "where" clause it takes at least 30-40 mins. Since I want to generate most of data on the next day, I need access to whole db table.
Could you please guide me to optimize / restructure the deployment model?
Site description:
mysql Ver 14.12 Distrib 5.0.24, for pc-linux-gnu (i686) using readline 5.0
4 GB RAM,
Dual Core dual CPU 3GHz
RHEL 3
my.cnf contents :
[mysqld]
datadir=/data/mysql/data/
socket=/tmp/mysql.sock
sort_buffer_size = 2000000
table_cache = 1024
key_buffer = 128M
myisam_sort_buffer_size = 64M
# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1
[mysql.server]
user=mysql
basedir=/data/mysql/data/
[mysqld_safe]
err-log=/data/mysql/data/mysqld.log
pid-file=/data/mysql/data/mysqld.pid
[root#reports root]#
DB table details:
CREATE TABLE `RAW_LOG_20100504` (
`DT` date default NULL,
`GATEWAY` varchar(15) default NULL,
`USER` bigint(12) default NULL,
`CACHE` varchar(12) default NULL,
`TIMESTAMP` varchar(30) default NULL,
`URL` varchar(60) default NULL,
`VERSION` varchar(6) default NULL,
`PROTOCOL` varchar(6) default NULL,
`WEB_STATUS` int(5) default NULL,
`BYTES_RETURNED` int(10) default NULL,
`RTT` int(5) default NULL,
`UA` varchar(100) default NULL,
`REQ_SIZE` int(6) default NULL,
`CONTENT_TYPE` varchar(50) default NULL,
`CUST_TYPE` int(1) default NULL,
`DEL_STATUS_DEVICE` int(1) default NULL,
`IP` varchar(16) default NULL,
`CP_FLAG` int(1) default NULL,
`USER_LOCATE` bigint(15) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 MAX_ROWS=200000000;
Thanks in advance!
Regards,
I would encourage you to learn how to use EXPLAIN to analyze the database's plan for query optimization. Also see Baron Schwartz' presentation EXPLAIN Demystified (link to PDF of his slides is on that page).
Learn how to create indexes -- this is not the same thing as a primary key or an auto-increment pseudokey. See the presentation More Mastering the Art of Indexing by Yoshinori Matsunobu.
Your table could use an index on CP_FLAG and WEB_STATUS.
CREATE INDEX CW ON RAW_LAW_20100503 (CP_FLAG, WEB_STATUS);
This helps to look up the subset of rows based on your cp_flag condition.
Then you still run into MySQL's unfortunate inefficiency with GROUP BY queries. It copies an interim result set into a temporary file on disk and sorts it there. Disk I/O tends to kill performance.
You can raise your sort_buffer_size configuration parameter until it's large enough that MySQL can sort the result set in memory instead of on disk. But that might not work.
You might have to resort to precalculating the COUNT() you need, and update this statistic periodically.
The comment from #Marcus gave me another idea. You're grouping by web status, and the set of distinct values of web status is a fairly short list and they don't change. So you could run a separate query for each distinct value and generate the results you need much faster than by using a GROUP BY query that creates a temp table to do the sorting. Or you could run a subquery for each status value, and UNION them together:
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 200)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 404)
UNION
(SELECT COUNT(*), WEB_STATUS FROM RAW_LOG_20100504 WHERE CP_FLAG > 0 AND WEB_STATUS = 304)
UNION
...etc...
ORDER BY 1 DESC;
Because your covering index includes CP_FLAG and WEB_STATUS, these queries never need to read the actual rows in the table. They only read entries in the index, which they can access much faster because (a) they're in a sorted tree, and (b) they may be cached in memory if you allocate enough to your key_buffer_size.
The EXPLAIN report I tried (with 1M rows of test data) shows that this uses indexes well, and does not create a temp table:
+------+--------------+------------------+------+--------------------------+
| id | select_type | table | key | Extra |
+------+--------------+------------------+------+--------------------------+
| 1 | PRIMARY | RAW_LOG_20100504 | CW | Using where; Using index |
| 2 | UNION | RAW_LOG_20100504 | CW | Using where; Using index |
| 3 | UNION | RAW_LOG_20100504 | CW | Using where; Using index |
| NULL | UNION RESULT | <union1,2,3> | NULL | Using filesort |
+------+--------------+------------------+------+--------------------------+
The Using filesort for the last line just means it has to sort without the benefit of an index. But sorting the three rows produced by the subqueries is trivial and MySQL does it in memory.
When designing optimal database solutions, there are rarely simple answers. A lot depends on how you use the data and what kind of queries are of higher priority to make fast. If there were a single, simple answer that worked in all circumstances, the software would just enable that design by default and you wouldn't have to do anything.
You really need to read a lot of manuals, books and blogs to understand how to take most advantage of all the features available to you.
Yes, I would still recommend using indexes. Clearly it was not working before, when you were querying 100 million rows without the benefit of an index.
You have to understand that you must design indexes that benefit the specific query you want to run. I have no way of knowing if the index you just described in your comment is appropriate, because you haven't shown the other query you're trying to speed up.
Indexing is a complex topic. If you define the index on the wrong columns, or if you get the columns in the wrong order, it may not be usable by a given query. I've been supporting SQL developers since 1994, and I've never found a single, concise rule to explain how to design indexes.
You seem like you need a mentor, because you're at a stage where you need a lot of questions answered. Is there someone where you work that you could ask to help you?
Add an index to any field that is in your where clause. Primary keys need to be unique; unique indexes need to be unique but uniqueness is not a prerequisite for an index.
Badly defined or non-existent indexes are one of the primary reasons for poor performance, and fixing these can often lead to phenomenal improvements
Quick info:
http://www.databasejournal.com/features/mysql/article.php/1382791/Optimizing-MySQL-Queries-and-Indexes.htm
http://www.tizag.com/mysqlTutorial/mysql-index.php
Related
I'm trying to figure out why a simple select with a LIMIT 1 clause (admittedly, on a really bloated table with a lot of rows and indices) is sometimes taking 30+ seconds (even 2 minutes, sometimes) to execute on an AWS RDS Aurora instance. This is on a writer instance.
It seems to occur for the first query from a client, only on a particular select that looks through hundreds of thousands of rows, and only sometimes.
The query is in the form:
SELECT some_table.col1, some_table.col2, some_table.col3, some_table.col4,
MAX(some_table.col2) AS SomeValue
FROM some_table
WHERE some_table.col3=123456 LIMIT 1;
And 'explain' outputs:
+----+-------------+---------------+------+---------------+---------+---------+-------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+---------+---------+-------+--------+-------+
| 1 | SIMPLE | some_table | ref | col1 | col1 | 4 | const | 268202 | NULL |
+----+-------------+---------------+------+---------------+---------+---------+-------+--------+-------+
I managed to reproduce the issue and captured the profile for the query in PhpMyAdmin. PhpMyAdmin recorded the query as taking 30.1 seconds to execute, but the profiler shows that execution itself takes less than a second:
So it looks like the execution itself isn't taking a lot of time; what could be causing this latency issue? I also found the same query recorded in RDS Performance Insights:
This seems to occur for the first query in a series of identical or similar queries. Could it be a caching issue? I've tried running RESET QUERY CACHE; in an attempt to reproduce the latency but with no success. Happy to provide more information about the infrastructure if that would help.
More info
SHOW VARIABLES LIKE 'query_cache%';
SHOW GLOBAL STATUS LIKE 'Qc%';
Rows examined and sent (from Performance Insights):
SHOW CREATE TABLE output:
CREATE TABLE `some_table` (
`col1` int(10) unsigned NOT NULL AUTO_INCREMENT,
`col2` int(10) unsigned NOT NULL DEFAULT '0',
`col3` int(10) unsigned NOT NULL DEFAULT '0',
`col4` int(10) unsigned NOT NULL DEFAULT '0',
`col5` mediumtext COLLATE utf8mb4_unicode_ci NOT NULL,
`col6` varchar(100) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
`col7` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`col1`),
KEY `col2` (`col2`),
KEY `col3` (`col3`),
KEY `col4` (`col4`),
KEY `col6` (`col6`),
KEY `col7` (`col7`)
) ENGINE=InnoDB AUTO_INCREMENT=123456789 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Possible explanations are:
The query is delayed from executing because it's waiting for a lock. Even a read-only query like SELECT may need to wait for a metadata lock.
The query must examine hundreds of thousands of rows, and it takes time to read those rows from storage. Aurora is supposed to have fast storage, but it can't be zero cost.
The system load on the Aurora instance is too high, because it's competing with other queries you are running.
The system load on the Aurora instance is too high, because the host is shared by other Aurora instances owned by other Amazon customers. This case is sometimes called "noisy neighbor" and there's practically nothing you can do to prevent it. Amazon automatically colocates virtual machines for different customers on the same hardware.
It's taking a long time to transfer the result set to the client. Since you use LIMIT 1, that single row would have to be huge to take 30 seconds, or else your client must be on a very slow network.
The query cache is not relevant the first time you run the query. Subsequently executing the same query will be faster, until some later time after the result has been evicted from the cache, or if any data in that table is updated, which forces the result of all queries against that table to be evicted from the query cache.
It seems that your understanding of the LIMIT function isn't quite right in this scenario.
If you were to run a simple function like SELECT * FROM tablea LIMIT 1; then the database would present you with the first row that it comes across and terminate there, giving you a quick return.
However in your example above, you have both an aggregate function and a WHERE clause.
Therefore in order for your database to return the first row, it must first return the whole data set and then work out what is the first row.
You can read more about this in this earlier question;
https://dba.stackexchange.com/a/62444
If you were to run this same query without limit 1 on the end you're likely to find that it will take around the same sort of time to return the result.
As you mentioned in your comment, it would be best to look at the schema and work out how this query can be amended to be more efficient.
EDIT 2: now that we have optimized the db and narrowed down in MySQL - Why is phpMyAdmin extremely slow with this query that is super fast in php/mysqli?
EDIT 1: there are two solutions that helped us. One on database level (configuration) and one on query level. I could of course only accept one as the best answer, but if you are having similar problems, look at both.
We have a database that has been running perfectly fine for years. However, right now, we have a problem that I don't understand. Is it a mysql/InnoDB configuration problem? And we currently have nobody for system maintenance (I am a programmer).
The tabel TitelDaggegevens is a few Gigs in size, about 12,000,000 records, so nothing extraordinary.
If we do:
SELECT *
FROM TitelDaggegevens
WHERE fondskosten IS NULL
AND (datum BETWEEN 20200401 AND 20200430)
it runs fine, within a few tenths of a second.
The result: 52 records.
Also if we add ORDER BY datum or if we order by any other non-indexed field: all is well, same speed.
However, if I add ORDER BY id (id being the primary key), suddenly the query takes 15 seconds for the same 52 records.
And when I ORDER BY another indexed field, the query-time increases tot 4-6 minutes. For ordering 52 records. On an indexed field.
I have no clue what is going on. EXPLAIN doesn't help me. I optimized/recreated the table, checked it, and restarted the server. All to no avail. I am absolutely no expert on configuring MySQL or InnoDB, so I have no clue where to start the search.
I am just hoping that maybe someone recognises this and can point me into the right direction.
SHOW TABLE STATUS WHERE Name = 'TitelDaggegevens'
Gives me:
I know this is a very vague problem, but I am not able to pin it down more specifically. I enabled the logging for slow queries but the table slow_log stays empty. I'm lost.
Thank you for any ideas where to look.
This might be a help to someone who knows something about it, but not really to me, phpmyadmins 'Advisor':
In the comments and a reaction were asked for EXPLAIN outputs:
1) Without ORDER BY and with ORDER BY datum (which is in the WHERE and has an index):
2) With ORDER BY plus any field other than datum (indexed or not, so the same for both quick and slow queries).
The table structure:
CREATE TABLE `TitelDaggegevens` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isbn` decimal(13,0) NOT NULL,
`datum` date NOT NULL,
`volgendeDatum` date DEFAULT NULL,
`prijs` decimal(8,2) DEFAULT NULL,
`prijsExclLaag` decimal(8,2) DEFAULT NULL,
`prijsExclHoog` decimal(8,2) DEFAULT NULL,
`stadiumDienstverlening` char(2) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`stadiumLevenscyclus` char(1) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`gewicht` double(7,3) DEFAULT NULL,
`volume` double(7,3) DEFAULT NULL,
`24uurs` tinyint(1) DEFAULT NULL,
`UitgeverCode` varchar(4) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`imprintId` int(11) DEFAULT NULL,
`distributievormId` tinyint(4) DEFAULT NULL,
`boeksoort` char(1) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
`publishingStatus` tinyint(4) DEFAULT NULL,
`productAvailability` tinyint(4) DEFAULT NULL,
`voorraadAlles` mediumint(8) unsigned DEFAULT NULL,
`voorraadBeschikbaar` mediumint(8) unsigned DEFAULT NULL,
`voorraadGeblokkeerdEigenaar` smallint(5) unsigned DEFAULT NULL,
`voorraadGeblokkeerdCB` smallint(5) unsigned DEFAULT NULL,
`voorraadGereserveerd` smallint(5) unsigned DEFAULT NULL,
`fondskosten` enum('depot leverbaar','depot onleverbaar','POD','BOV','eBoek','geen') COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ISBN+datum` (`isbn`,`datum`) USING BTREE,
KEY `UitgeverCode` (`UitgeverCode`),
KEY `Imprint` (`imprintId`),
KEY `VolgendeDatum` (`volgendeDatum`),
KEY `Index op voorraad om maxima snel te vinden` (`isbn`,`voorraadAlles`) USING BTREE,
KEY `fondskosten` (`fondskosten`),
KEY `Datum+isbn+fondskosten` (`datum`,`isbn`,`fondskosten`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=16519430 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci
Have this to handle the WHERE entirely:
INDEX(fondskosten, Datum)
Note: the = is first, then the range.
Fetch the *. Note: If there are big TEXT or BLOB columns that you don't need, spell out the SELECT list so you can avoid them. They may be stored "off-record", hence take longer to fetch.
An optional ORDER BY. If it is on Datum, then there is no extra effort. If it is on any other column, then there will be a sort. But a sort of 52 rows will be quite fast (milliseconds).
Notes:
If you don't have fondskosten IS NULL or you have some other test, then all bets are off. We have to start over in designing the optimal composite index.
USE/FORCE INDEX -- use this as a last resort.
Always provide SHOW CREATE TABLE when needing to discuss a query.
The Advisor has some good stuff, but without any clues of what is "too big", it is rather useless.
I suspect all the other discussions failed to realize that there are far more than 52 rows for the given Datum range. That is fondskosten IS NULL is really part of the problem and solution.
For people searching for tweaks in similar cases, these are the tweaks the specialist made to the db that sped it up considerably (mind you this is for a database with 100s of tables and MANY very complex and large queries sometimes joining over 15 tables but not super massive number of records. The database is only 37 gigabytes.
[mysqld]
innodb_buffer_pool_size=2G
innodb_buffer_pool_instances=4
innodb_flush_log_at_trx_commit=2
tmp_table_size=64M
max_heap_table_size=64M
join_buffer_size=4M
sort_buffer_size=8M
optimizer_search_depth=5
The optimizer_search_depth was DECREASED to minimize the time the optimizer needs for the complex queries.
After restarting the server, (regularly) run all queries that are the result of running this query:
SELECT CONCAT('OPTIMIZE TABLE `', TABLE_SCHEMA , '`.`', TABLE_NAME ,'`;') AS query
FROM INFORMATION_SCHEMA.TABLES
WHERE DATA_FREE/DATA_LENGTH > 2 AND DATA_LENGTH > 4*1024*1024
(This first one better when the server is off line or has low use if you have large tables. It rebuilds and thus optimizes the tables that need it.)
And then:
SELECT CONCAT('ANALYZE TABLE `', TABLE_SCHEMA , '`.`', TABLE_NAME ,'`;') AS query
FROM INFORMATION_SCHEMA.TABLES
WHERE DATA_FREE/DATA_LENGTH > 2 AND DATA_LENGTH > 1*1024*1024
(This second querie-series is much lighter and less infringing but may still help speed up some queries by recalculating query strategies by the server.)
Looks like ORDER BY uses 3 different optimization plans
ORDER BY id - Extra: Using index condition; Using where; Using filesort. MySQL uses filesort to resolve the ORDER BY. But rows are sorted already. So, it takes 15 second.
ORDER BY Datum or other non-indexed field - Extra: Using index condition; Using where. MySQL uses Datum index to resolve the ORDER BY. It takes few seconds.
ORDER BY index_field - Extra: Using index condition; Using where; Using filesort. MySQL uses filesort to resolve the ORDER BY. Rows are unsorted. It takes few minutes.
It's my suggestion. Only EXPLAIN can tells what's going on
Influencing ORDER BY Optimization
UPD:
Could you check this query with every ORDER BY clauses?
SELECT *
FROM TitelDaggegevens USE INDEX FOR ORDER BY (Datum)
WHERE fondskosten IS NULL
AND (Datum BETWEEN 20200401 AND 20200430)
Also you may try to increasing the sort_buffer_size
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
On Linux, there are thresholds of 256KB and 2MB where larger values may significantly slow down memory allocation, so you should consider staying below one of those values.
Forgive me for asking what should be a simple question but I am totally new to Sphinx.
I am using Sphinx with a mySQL datastore. The table looks like below with the Title and Content fields indexed by Sphinx.
CREATE TABLE `documents` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`group_id` int(11) NOT NULL,
`group_id2` int(11) NOT NULL,
`date_added` datetime NOT NULL,
`title` varchar(255) NOT NULL,
`content` text NOT NULL,
`url` varchar(255) NOT NULL,
`links` int(11) NOT NULL,
`hosts` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`)
) ENGINE=InnoDB AUTO_INCREMENT=439043 DEFAULT CHARSET=latin1
Now, if I connect through Sphinx with
mysql -h0 -P9306
I can run a simple query like...
SELECT * FROM test1 WHERE MATCH('test document');
And I will get back a result set like...
+--------+----------+------------+
| id | group_id | date_added |
+--------+----------+------------+
| 360625 | 1 | 1499727792 |
| 362257 | 1 | 1499727807 |
| 362777 | 1 | 1499727811 |
| 159717 | 1 | 1499717614 |
| 160557 | 1 | 1499717621 |
----------------------------------
When what I actually want is it to return a result set with column values from the documents table (like the URL, Title, Links, Hosts, etc. columns) and, if all possible, sort these by the relevancy of the Sphinx match.
Can that be accomplished in a single query? What might it look like?
Thanks in advance!
Two (main) options
Take the ids from the SphinxQL result, and run a MySQL Query to get the full details, see http://sphinxsearch.com/info/faq/#row-storage
eg SELECT * FROM documents WHERE id IN (3,5,7) ORDER BY FIELD(id,3,5,7)
This MySQL query, should be VERY quick, because its a PK lookup, and only retrieving a few rows (ie one page of results) - the heavy lifting of searching the whole table has already been done in first Sphinx Query.
Duplicate all the columns you want to retrieve in the resultset as Attributes. You've already made group_id and date_added as attributes, would need to make more attributes.
sql_field_string is a very convenient shortcut to make BOTH a Field and an String Attribute from one column. Not available for other column types, but less useful, as numeric columns, are not typically needed as fields anyway.
option 1 is good in it avoids duplicating the data, and saves memory (Sphinx wants to typically hold attributes in memory) - and may be most practical on big datasets.
whereas option 2 is good in that it avoids a second query for each result. But because have a copy of data, it may mean additional complication syncing.
Doesn't look relevant in your case, but if say had a 'clicks' column, which you want in increment often (when users click!), and need it in resultset but you don't really need it in sphinx for query purposes, the first option, would allow you only have to increment it in database, and the mysql query would always get the live value. But the second option means having to keep sphinx index in 'sync' at all times)
I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB
I'm trying to figure out how to optimize a very slow query in MySQL (I didn't design this):
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
+----------+
| COUNT(*) |
+----------+
| 3224022 |
+----------+
1 row in set (1 min 0.16 sec)
Comparing that to a full count:
select count(*) from change_event;
+----------+
| count(*) |
+----------+
| 6069102 |
+----------+
1 row in set (4.21 sec)
The explain statement doesn't help me here:
explain SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: me
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: NULL
rows: 4120213
Extra: Using where; Using index
1 row in set (0.00 sec)
OK, it still thinks it needs roughly 4 million entries to count, but I could count lines in a file faster than that! I don't understand why MySQL is taking this long.
Here's the table definition:
CREATE TABLE `change_event` (
`change_event_id` bigint(20) NOT NULL default '0',
`timestamp` datetime NOT NULL,
`change_type` enum('create','update','delete','noop') default NULL,
`changed_object_type` enum('Brand','Broadcast','Episode','OnDemand') NOT NULL,
`changed_object_id` varchar(255) default NULL,
`changed_object_modified` datetime NOT NULL default '1000-01-01 00:00:00',
`modified` datetime NOT NULL default '1000-01-01 00:00:00',
`created` datetime NOT NULL default '1000-01-01 00:00:00',
`pid` char(15) default NULL,
`episode_pid` char(15) default NULL,
`import_id` int(11) NOT NULL,
`status` enum('success','failure') NOT NULL,
`xml_diff` text,
`node_digest` char(32) default NULL,
PRIMARY KEY (`change_event_id`),
KEY `idx_change_events_changed_object_id` (`changed_object_id`),
KEY `idx_change_events_episode_pid` (`episode_pid`),
KEY `fk_import_id` (`import_id`),
KEY `idx_change_event_timestamp_ce_id` (`timestamp`,`change_event_id`),
KEY `idx_change_event_status` (`status`),
CONSTRAINT `fk_change_event_import` FOREIGN KEY (`import_id`) REFERENCES `import` (`import_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Version:
$ mysql --version
mysql Ver 14.12 Distrib 5.0.37, for pc-solaris2.8 (i386) using readline 5.0
Is there something obvious I'm missing? (Yes, I've already tried "SELECT COUNT(change_event_id)", but there's no performance difference).
InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages. In order to do a range scan you still have to scan through all of the potentially wide rows in data pages; note that this table contains a TEXT column.
Two things I would try:
run optimize table. This will ensure that the data pages are physically stored in sorted order. This could conceivably speed up a range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id column. This will store a copy of that column in index pages which be much faster to scan. After creating it, check the explain plan to make sure it's using the new index.
(you also probably want to make the change_event_id column bigint unsigned if it's incrementing from zero)
Here are a few things I suggest:
Change the column from a "bigint" to an "int unsigned". Do you really ever expect to have more than 4.2 billion records in this table? If not, then you're wasting space (and time) the the extra-wide field. MySQL indexes are more efficient on smaller data types.
Run the "OPTIMIZE TABLE" command, and see whether your query is any faster afterward.
You might also consider partitioning your table according to the ID field, especially if older records (with lower ID values) become less relevant over time. A partitioned table can often execute aggregate queries faster than one huge, unpartitioned table.
EDIT:
Looking more closely at this table, it looks like a logging-style table, where rows are inserted but never modified.
If that's true, then you might not need all the transactional safety provided by the InnoDB storage engine, and you might be able to get away with switching to MyISAM, which is considerably more efficient on aggregate queries.
I've run into behavior like this before with IP geolocation databases. Past some number of records, MySQL's ability to get any advantage from indexes for range-based queries apparently evaporates. With the geolocation DBs, we handled it by segmenting the data into chunks that were reasonable enough to allow the indexes to be used.
Check to see how fragmented your indexes are. At my company we have a nightly import process that trashes our indexes and over time it can have a profound impact on data access speeds. For example we had a SQL procedure that took 2 hours to run one day after de-fragmenting the indexes it took 3 minutes. we use SQL Server 2005 ill look for a script that can check this on MySQL.
Update: Check out this link: http://dev.mysql.com/doc/refman/5.0/en/innodb-file-defragmenting.html
Run "analyze table_name" on that table - it's possible that the indices are no longer optimal.
You can often tell this by running "show index from table_name". If the cardinality value is NULL then you need to force re-analysis.
MySQL does say "Using where" first, since it does need to read all records/values from the index data to actually count them. With InnoDb it also tries to "grab" that 4 mil record range to count it.
You may need to experiment with different transaction isolation levels: http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html#isolevel_read-uncommitted
and see which one is better.
With MyISAM it would be just fast, but with intensive write model will result in lock issues.
To make the search more efficient, although I recommend adding index. I leave the command for you to try the metrics again
CREATE INDEX ixid_1 ON change_event (change_event_id);
and repeat query
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
-JACR
I would create a "counters" table and add "create row"/"delete row" triggers to the table you are counting. The triggers should increase/decrease count values on "counters" table on every insert/delete, so you won't need to compute them every time you need them.
You can also accomplish this on the application side by caching the counters but this will involve clearing the "counter cache" on every insertion/deletion.
For some reference take a look at this http://pure.rednoize.com/2007/04/03/mysql-performance-use-counter-tables/