I have a MariaDB table with just under 100000 rows, and selecting the count takes a very long time (almost 2 minutes).
Selecting anything by id from the table though takes only 4 milliseconds.
The text field here contains on average 5000 characters.
How can I speed this up?
MariaDB [companies]> describe company_details;
+---------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| id | int(10) unsigned | NO | PRI | NULL | |
| details | text | YES | | NULL | |
+---------+------------------+------+-----+---------+-------+
MariaDB [companies]> explain select count(id) from company_details;
+------+-------------+-----------------+-------+---------------+---------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+-------+---------------+---------+---------+------+-------+-------------+
| 1 | SIMPLE | company_details | index | NULL | PRIMARY | 4 | NULL | 71267 | Using index |
+------+-------------+-----------------+-------+---------------+---------+---------+------+-------+-------------+
MariaDB [companies]> analyze table company_details;
+---------------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------------------------+---------+----------+----------+
| companies.company_details | analyze | status | OK |
+---------------------------+---------+----------+----------+
1 row in set (0.098 sec)
MariaDB [companies]> select count(id) from company_details;
+-----------+
| count(id) |
+-----------+
| 96544 |
+-----------+
1 row in set (1 min 43.199 sec)
This becomes an even bigger problem when I try to join the table.
For example, to find the number of companies which do not have associated details:
MariaDB [companies]> SELECT COUNT(*) FROM company c LEFT JOIN company_details cd ON c.id = cd.id WHERE cd.id IS NULL;
+----------+
| count(*) |
+----------+
| 42178 |
+----------+
1 row in set (10 min 28.846 sec)
Edit:
After running OPTIMIZE on the table, the select count has improved speed from 1min 43sec to just 5 sec, and the join has improved speed from 10 minutes to 25 seconds.
MariaDB [companies]> optimize table company_details;
+---------------------------+----------+----------+-------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------------------------+----------+----------+-------------------------------------------------------------------+
| companies.company_details | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| companies.company_details | optimize | status | OK |
+---------------------------+----------+----------+-------------------------------------------------------------------+
2 rows in set (11 min 21.195 sec)
I think there's OPTIMIZE -command for rebuilding indexes.
OPTIMIZE company_details;
This usually takes some time to complete. More details: https://mariadb.com/kb/en/optimize-table/
Related
Background
I made a small table of 10 rows from a previous SELECT already ran (SavedAnimals).
I have a massive table (animals) which I would like to UPDATE using the rows with the same id as each row in my new table.
What I have tried so far
I can quickly SELECT the desired rows from the big table like this:
mysql> EXPLAIN SELECT * FROM animals WHERE ignored=0 and id IN (SELECT animal_id FROM SavedAnimals);
+------+--------------+-------------------------------+--------+---------------+---------+---------+----------------------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------------------------+--------+---------------+---------+---------+----------------------------------------------------------+------+-------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 10 | |
| 1 | PRIMARY | animals | eq_ref | PRIMARY | PRIMARY | 8 | db_staging.SavedAnimals.animal_id | 1 | Using where |
| 2 | MATERIALIZED | SavedAnimals | ALL | NULL | NULL | NULL | NULL | 10 | |
+------+--------------+-------------------------------+--------+---------------+---------+---------+----------------------------------------------------------+------+-------------+
But the "same" command on the UPDATE is not quick:
mysql> EXPLAIN UPDATE animals SET ignored=1, ignored_when=CURRENT_TIMESTAMP WHERE ignored=0 and id IN (SELECT animal_id FROM SavedAnimals);
+------+--------------------+-------------------------------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------------------------------+-------+---------------+---------+---------+------+----------+-------------+
| 1 | PRIMARY | animals | index | NULL | PRIMARY | 8 | NULL | 34269464 | Using where |
| 2 | DEPENDENT SUBQUERY | SavedAnimals | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
+------+--------------------+-------------------------------+-------+---------------+---------+---------+------+----------+-------------+
2 rows in set (0.00 sec)
The UPDATE command never finishes if I run it.
QUESTION
How do I make mariaDB run with the Materialized select_type on the UPDATE like it does on the SELECT?
OR
Is there a totally separate way that I should approach this which would be quick?
Notes
Version: 10.3.23-MariaDB-log
Use JOIN rather than WHERE...IN. MySQL tends to optimize them better.
UPDATE animals AS a
JOIN SavedAnimals AS sa ON a.id = sa.animal_id
SET a.ignored=1, a.ignored_when=CURRENT_TIMESTAMP
WHERE a.ignored = 0
You should find an EXISTS clause more efficient than an IN clause. For example:
UPDATE animals a
SET a.ignored = 1,
a.ignored_when = CURRENT_TIMESTAMP
WHERE a.ignored = 0
AND EXISTS (SELECT * FROM SavedAnimals sa WHERE sa.animal_id = a.id)
I was trying to improve performance on some queries through indexes using EXPLAIN and I noticed each time I used SHOW index FROM TableB; the output of the rows colums in the EXPLAIN of a query changed
Ex:
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code
Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 10561 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
mysql> show index from TableB;
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| TableB | 0 | PRIMARY | 1 | id | A | 7 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 2 | address | A | 21 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 3 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 1 | address | A | 1 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 2 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 3 | id | A | 10402 | NULL | NULL | | BTREE | |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.03 sec)
and...
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 9800 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
Why does this happen?
The rows column should be taken as a rough estimate only. It's not a precise number.
It's based on statistical estimates of how many rows will be examined during a query. The actual number of rows cannot be known until you actually execute the query.
The statistics are based on samples read from the table periodically. These samples are re-read occasionally, for example after you run ANALYZE TABLE or certain INFORMATION_SCHEMA queries, or certain SHOW statements.
I don't find 20% variation in statistics to be a big deal. In many situations, think of the graph being like an upturned parabola, and you need to know which side of the minimum point you are on. In complex queries, where the Optimizer is likely to goof, it need a lot more than simple stats, such as Histograms of MariaDB 10.0 / 10.1. (I don't have enough experience with such to say whether that makes much headway.)
Your particular query is probably going to be performed in only one way, regardless of the statistics. An example of a complicated query would be a JOIN with WHERE clauses filtering each table. The optimizer has to decide which table to start with. Another case is a single table with a WHERE and ORDER BY and they cannot both be handled by a single index -- should it use an index to filter, but then have to sort? or should it use an index for ORDER BY, but then have to filter on the fly?
I have table as
+-------------------+----------------+------+-----+---------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+----------------+------+-----+---------------------+-----------------------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| runtime_id | bigint(20) | NO | MUL | NULL | |
| place_id | bigint(20) | NO | MUL | NULL | |
| amended_timestamp | varchar(50) | YES | | NULL | |
| applicable_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| schedule_time | timestamp | NO | MUL | 0000-00-00 00:00:00 | |
| quality_indicator | varchar(10) | NO | | NULL | |
| flow_rate | decimal(15,10) | NO | | NULL | |
+-------------------+----------------+------+-----+---------------------+-----------------------------+
I have index on schedule_time as
create index table_index on table(schedule_time asc);
The table currently has 2121552+ records.
The thing I fail to understand is when I do explain
explain select runtime_id from table where schedule_time >= now() - INTERVAL 1 DAY;
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
| 1 | SIMPLE | table | range | table_index | table_index | 4 | NULL | 38088 | Using where |
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
1 row in set (0.00 sec)
Above index is used, but the below one not.
mysql> explain select runtime_id from table where schedule_time >= now() - INTERVAL 30 DAY;
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | table | ALL | table_index | NULL | NULL | NULL | 2118107 | Using where |
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)
I'll really appreciate if someone can point out whats wrong here, as the data is updated every 12 minutes and as the time passes by query for 30 days or may be 60 days will get very slow.
The final query where I plan to use it is as follows
select avg(flow_rate),c.group from table a ,(select runtime_id from table where schedule_time >= now() - INTERVAL 1 DAY group by schedule_time ) b,place c where a.runtime_id = b.runtime_id and a.place_id = c.id group by c.group;
Update =====>
As per the comments between fails too.
mysql> explain select runtime_id from table where schedule_time between '2013-07-17 12:48:00' and '2013-08-17 12:48:00';
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | table | ALL | table_index | NULL | NULL | NULL | 2118431 | Using where |
+----+-------------+----------+------+------------------------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)
mysql> explain select runtime_id from table where schedule_time between '2013-08-16 12:48:00' and '2013-08-17 12:48:00';
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
| 1 | SIMPLE | table | range | table_index | table_index | 4 | NULL | 38770 | Using where |
+----+-------------+----------+-------+------------------------------+------------------------------+---------+------+-------+-------------+
1 row in set (0.00 sec)
Update 2 =======>
mysql> select count(*) from table where schedule_time between '2013-08-16 12:48:00' and '2013-08-17 12:48:00';
+----------+
| count(*) |
+----------+
| 19440 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from table where schedule_time between '2013-07-17 12:48:00' and '2013-08-17 12:48:00';
+----------+
| count(*) |
+----------+
| 597132 |
+----------+
1 row in set (0.00 sec)
Server version: 5.5.24-0ubuntu0.12.04.1 (Ubuntu)
The MySQL optimizer tries to do the fastest thing. Where it thinks that using the index will take as long or longer than doing a table scan, it abandons the available index.
This is what you see it doing in your examples:
where the range is small (1 day) the index will be faster;
where the range is large, you're going to be hitting so much more of the table you might as well scan the table directly (remember, using the index involves searching the index and then grabbing the indexed records from the table - two sets of seeks).
If you think you know better than the optimizer (it isn't perfect), use INDEX hints:
The USE INDEX (index_list) hint tells MySQL to use only one of the
named indexes to find rows in the table. The alternative syntax IGNORE
INDEX (index_list) tells MySQL to not use some particular index or
indexes. These hints are useful if EXPLAIN shows that MySQL is using
the wrong index from the list of possible indexes.
I think that my question can be solved by just knowing how, for example, stackoverflow works.
For example, this page, loads in a few ms (< 300ms):
https://stackoverflow.com/questions?page=61440&sort=newest
The only query i can think about for that page is something like SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {pageNumber}*{stuffPerPage}+{stuffPerPage}
A query like that might take several seconds to run, but the stack overflow page loads almost in no time. It can't be a cached query, since that question are posted over time and rebuild the cache every time a question is posted is simply madness.
So, how do this works in your opinion?
(to make the question easier, let's forget about the ORDER BY)
Example (the table is fully cached in ram and stored in an ssd drive)
mysql> select * from thread limit 1000000, 1;
1 row in set (1.61 sec)
mysql> select * from thread limit 10000000, 1;
1 row in set (16.75 sec)
mysql> describe select * from thread limit 1000000, 1;
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
| 1 | SIMPLE | thread | ALL | NULL | NULL | NULL | NULL | 64801163 | |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
mysql> select * from thread ORDER BY thread_date DESC limit 1000000, 1;
1 row in set (1 min 37.56 sec)
mysql> SHOW INDEXES FROM thread;
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| thread | 0 | PRIMARY | 1 | newsgroup_id | A | 102924 | NULL | NULL | | BTREE | | |
| thread | 0 | PRIMARY | 2 | thread_id | A | 47036298 | NULL | NULL | | BTREE | | |
| thread | 0 | PRIMARY | 3 | postcount | A | 47036298 | NULL | NULL | | BTREE | | |
| thread | 0 | PRIMARY | 4 | thread_date | A | 47036298 | NULL | NULL | | BTREE | | |
| thread | 1 | date | 1 | thread_date | A | 47036298 | NULL | NULL | | BTREE | | |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)
Create a BTREE index on date column and the query will run in a breeze.
CREATE INDEX date ON stuff(date) USING BTREE
UPDATE: Here is a test I just did:
CREATE TABLE test( d DATE, i INT, INDEX(d) );
Filled the table with 2,000,000 rows with different unique is and ds
mysql> SELECT * FROM test LIMIT 1000000, 1;
+------------+---------+
| d | i |
+------------+---------+
| 1897-07-22 | 1000000 |
+------------+---------+
1 row in set (0.66 sec)
mysql> SELECT * FROM test ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d | i |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (1.68 sec)
And here is an interesiting observation:
mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 1000, 1;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | test | index | NULL | d | 4 | NULL | 1001 | |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 10000, 1;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 2000343 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
MySql does use the index for OFFSET 1000 but not for 10000.
Even more interesting, if I do FORCE INDEX query takes more time:
mysql> SELECT * FROM test FORCE INDEX(d) ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d | i |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (2.21 sec)
I think StackOverflow doesn't need to reach the rows at offset 10000000. The query below should be fast enough if you have an index on date and the numbers in LIMIT clause are from real world examples, not millions :)
SELECT *
FROM stuff
ORDER BY date DESC
LIMIT {pageNumber}*{stuffPerPage}, {stuffPerPage}
UPDATE:
If records in a table are relatively rarely deleted (like in StackOverflow) then you can use the following solution:
SELECT *
FROM stuff
WHERE id between
{stuffCount}-{pageNumber}*{stuffPerPage}+1 AND
{stuffCount}-{pageNumber-1}*{stuffPerPage}
ORDER BY id DESC
Where {stuffCount} is:
SELECT MAX(id) FROM stuff
If you have some deleted records in a database then some pages will have less than {stuffPerPage} records, but it should not be the problem. StackOverflow uses some inaccurate algorithm too. For instance try to go to the first page and to the last page and you'll see that both pages return 30 records per page. But mathematically it's nonsense.
Solutions designed to work with large databases often uses some hacks which usually are unnoticeable for regular users.
Nowadays paging with millions of records is not modish, because it's impractical. Currently it's popular to use infinite scrolling (automatic or manual with button click). It has more sense and pages load faster because they don't need to be reloaded. If you think that old records can be useful for your users too, then it's a good idea to create a page with random records (with infinite scrolling too). This was my opinion :)
I have a MySQL 5.0 database with a few tables containing over 50M rows. But how do I know this? By running "SELECT COUNT(1) FROM foo", of course. This query on one table containing 58.8M rows took 10 minutes to complete!
mysql> SELECT COUNT(1) FROM large_table;
+----------+
| count(1) |
+----------+
| 58778494 |
+----------+
1 row in set (10 min 23.88 sec)
mysql> EXPLAIN SELECT COUNT(1) FROM large_table;
+----+-------------+-------------------+-------+---------------+----------------------------------------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+----------------------------------------+---------+------+-----------+-------------+
| 1 | SIMPLE | large_table | index | NULL | fk_large_table_other_table_id | 5 | NULL | 167567567 | Using index |
+----+-------------+-------------------+-------+---------------+----------------------------------------+---------+------+-----------+-------------+
1 row in set (0.00 sec)
mysql> DESC large_table;
+-------------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| created_on | datetime | YES | | NULL | |
| updated_on | datetime | YES | | NULL | |
| other_table_id | int(11) | YES | MUL | NULL | |
| parent_id | bigint(20) unsigned | YES | MUL | NULL | |
| name | varchar(255) | YES | | NULL | |
| property_type | varchar(64) | YES | | NULL | |
+-------------------+---------------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)
All of the tables in question are InnoDB.
Any ideas why this is so slow, and how I can speed it up?
Counting all the rows in a table is a very slow operation; you can't really speed it up, unless you are prepared to keep a count somewhere else (and of course, that can become out of sync).
People who are used to MyISAM tend to think that they get count(*) "for free", but it's not really. MyISAM cheats by not having MVCC, which makes it fairly easy.
The query you're showing is doing a full index scan of a not-null index, which is generally the fastest way of counting the rows in an innodb table.
It is difficult to guess from the information you've given, what your application is, but in general, it's ok for users (etc) to see close approximations of the number of rows in large tables.
If you need to have the result instantly and you don't care if it's 58.8M or 51.7M, you can find out the approximate number of rows by calling
show table status like 'large_table';
See the column rows
For more information about the result take a look at the manual at http://dev.mysql.com/doc/refman/5.1/en/show-table-status.html
select count(id) from large_table will surely run faster