Optimizing MySQL search query - mysql

Need yours help optimizing one mysql query. Lets take simple table for example.
CREATE TABLE `Modules` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`moduleName` varchar(100) NOT NULL,
`menuName` varchar(255) NOT NULL,
PRIMARY KEY (`ID`),
KEY `moduleName` (`moduleName`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Lets Fill it with some data:
INSERT INTO `Modules` (`moduleName` ,`menuName`)
VALUES
('abc1', 'name1'),
('abc', 'name2'),
('ddf', 'name3'),
('ccc', 'name4'),
('fer', 'name5');
And some sample string. Let it be abc_def;
Traditionally we are trying to find all the rows containing search string.
On the contrary, my task is, to find all rows which contains moduleName in input string. For now I have following query to get desired result:
SELECT `moduleName` ,`menuName`
FROM `Modules`
WHERE 'abc_def' LIKE(CONCAT(`moduleName`,'%'))
This will return
moduleName | menuName
---------------------------
abc | name2
The problem is, that this query is not using index.
Is there some way to force it to use one?

You seem to misunderstand what is index and how it can help to speed up a query.
Let's look at what is your moduleName index. It is basically a sorted list of mappings from moduleName to ID. And what you are selecting?
SELECT moduleName, menuName
FROM Modules
WHERE 'abc_def' LIKE CONCAT(moduleName,'%');
That is you want some two fields for each row that has some relation to a somehow mapped value of moduleName field. How can ever index help you? There is no exact match, and there is no way to take advantage from the fact that we have moduleNames sorted.
What you need to take advantage from the index, is to have a check for exact match in the condition:
SELECT moduleName, menuName
FROM Modules
WHERE moduleName = LEFT('abc_def', LENGTH(moduleName));
Now we do have an exact match, but since the right part of the condition depends on the moduleName as well, this condition will be checked for each row. Since in his case MySQL can not predict how many rows will match, but it can predict that it will need randon disk access to fetch menuNames for each matching row, MySQL will not use the index.
So you have basically two approaches:
if you know that the condition narrows the number of matching rows significantly, then you can just force the index
another option is to extend your index to a covering composite index (moduleName, menuName), then all results for query will be fetched from the index directly (that is from memory).
Approach #2 (see SQLfiddle) will get you an index hit with a simple query, and should offer much better performances on a larger table. On small tables, I (i.e., lserni - see comment) don't think it's worth the effort.

You are effectively doing a regex on the field, so no key is going to work. However, in your example, you could make it a bit more efficient as each moduleName that matches must be less than or equal to 'abc_def', so you can add:
and moduleName <= 'abc_def'
The only other alternative I can think of is:
where modleName in ('a','ab','abc','abc_','abc_d','abc_de','abc_def')
Not pretty.

Try adding an index hint to your question.
SELECT `moduleName` ,`menuName`
FROM `Modules` USE INDEX (col1_index,col2_index)
WHERE 'abc_def' LIKE(CONCAT(`moduleName`,'%'))

Since, your dtabase engine is "InnoDB"
All user data by default in InnoDB is stored in pages comprising a B-tree index
B-tree are good for following lookups:
● Exact full value (= xxx)
● Range of values (BETWEEN xx AND yy)
● Column prefix (LIKE 'xx%')
● Leftmost prefix
So, for your query, rather than using index or something to optimize,
we can think of speeding up the query.
You can speed up the query by creating covering index .
A covering index refers to the case when all fields selected in a query are covered by an index, in that case InnoDB (not MyISAM) will never read the data in the table, but only use the data in the index, significantly speeding up the select.
Note that in InnoDB the primary key is included in all secondary indexes, so in a way all secondary indexes are compound indexes.
This means that if you run the following query on InnoDB:
SELECT `moduleName` ,`menuName`
FROM `Modules1`
WHERE 'abc_def' LIKE(CONCAT(`moduleName`,'%'))
MySQL will always use a covering index and will not access the actual table
To believe, go to **Explain**
What does Explain statement mean?
table: Indicates which table the output is affected.
type: Shows us which type of join is being used. From best to worst
the types are: system, const, eq_ref, ref, range, index, all
possible_keys: Indicates which indices MySQL can choose from to find the rows in this table
key: Indicates the key (index) that MySQL actually decided to use. If MySQL decides to use one of the possible_keys indexes to look up rows, that index is listed as the key value.
key_len: It's the length of the key used. The shorter the better.
ref: Which column (or constant) is used
rows: The number of rows MySQL believes it must examine to execute the query.
extra Extra info: the bad ones to see here are "using temporary" and "using filesort"
I had 1,990 rows.
My Experiments:
I would recommend Isern's solution for where clause
case 1) no indexes
explain select `moduleName` ,`menuName` FROM `Modules1` WHERE moduleName = SUBSTRING('abc_def', 1, LENGTH(moduleName));
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Modules | ALL | NULL | NULL | NULL | NULL | 2156 | Using where |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
Ways of creating covering indexes
case 2) ALTER TABLE `test`.`Modules1` ADD index `mod_name` (`moduleName`)
explain select `moduleName` ,`menuName` FROM `Modules1` WHERE moduleName = SUBSTRING('abc_def', 1, LENGTH(moduleName));
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Modules | ALL | NULL | NULL | NULL | NULL | 2156 | Using where |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
Here, it shows index being used. See the columns: key, Extra
case 3) ALTER TABLE `test`.`Modules1` DROP INDEX `mod_name` ,
ADD INDEX `mod_name` ( `moduleName` , `menuName` )
explain select `moduleName` ,`menuName` FROM `Modules1` WHERE moduleName = SUBSTRING('abc_def', 1, LENGTH(moduleName));
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | Modules | index | NULL | mod_name | 1069 | NULL | 2066 | Using where; Using index |
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
case 4) ALTER TABLE `test`.`Modules1` DROP INDEX `mod_name` ,
ADD INDEX `mod_name` ( `ID` , `moduleName` , `menuName` )
explain select `moduleName` ,`menuName` FROM `Modules1` WHERE moduleName = SUBSTRING('abc_def', 1, LENGTH(moduleName));
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | Modules | index | NULL | mod_name | 1073 | NULL | 2061 | Using where; Using index |
+----+-------------+----------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
edit:
use where moduleName regexp "^(a|ab|abc|abc_|abc_d|abc_de|abc_def)$";
in place of substring()

DECLARE #SEARCHING_TEXT AS VARCHAR(500)
SET #SEARCHING_TEXT = 'ab'
SELECT 'moduleName' ,'menuName' FROM [MODULES] WHERE FREETEXT
(MODULENAME, #SEARCHING_TEXT );

Similar to the suggestion by fthiella, but more flexible (as it can cope with longer string easily):-
SELECT DISTINCT `moduleName` ,`menuName`
FROM `Modules`
CROSS JOIN (SELECT a.i + b.i * 10 + c.i * 100 + 1 AS anInt FROM integers a, integers b, integers c) Sub1
WHERE LEFT('abc_def', Sub1.anInt) = `moduleName`
This (as typed) copes with string up to 1000 characters long but is slower than fthiellas solution. Can easily be cut down for strings up to 100 chars long at which point it seems marginally quicker than fthiellas solution.
Putting a check for length in it does speed it up a bit:-
SELECT SQL_NO_CACHE DISTINCT `moduleName` ,`menuName`
FROM `Modules`
INNER JOIN (SELECT a.i + b.i * 10 + c.i * 100 + 1 AS anInt FROM integers a, integers b, integers c ) Sub1
ON Sub1.anInt <= LENGTH('abc_def') AND Sub1.anInt <= LENGTH(`moduleName`)
WHERE LEFT('abc_def', Sub1.anInt) = `moduleName`
Or with a slight amendment to bring the possible substrings back from the subselect:-
SELECT SQL_NO_CACHE DISTINCT `moduleName` ,`menuName`
FROM `Modules`
CROSS JOIN (SELECT DISTINCT LEFT('abc_def', a.i + b.i * 10 + c.i * 100 + 1) AS aStart FROM integers a, integers b, integers c WHERE( a.i + b.i * 10 + c.i * 100 + 1) <= LENGTH('abc_def')) Sub1
WHERE aStart = `moduleName`
Note that these solutions depends on a table of integers with a single column and rows with the values 0 to 9.

my answer may more complex
alter table Modules add column name_index int
alter table Modules add index name_integer_index(name_index);
when you insert to modules table you caculate a int value of moduleName, something like select ascii('a')
when run your query, you just need to run
SELECT `moduleName`, `menuName`
FROM `Modules`
WHERE name_index >
(select ascii('a')) and name_index < (select ascii('abc_def'))
it will use name_integr_index

I am not sure this is really a nice query, but it makes use of the index:
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 7) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 6) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 5) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 4) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 3) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 2) = `moduleName`
UNION ALL
SELECT `moduleName` ,`menuName`
FROM `Modules` WHERE LEFT('abc_def', 1) = `moduleName`
General solution
And this is a general solution, using a dynamic query:
SET #search='abc_def';
SELECT
CONCAT(
'SELECT `moduleName` ,`menuName` FROM `Modules` WHERE ',
GROUP_CONCAT(
CONCAT(
'moduleName=\'',
LEFT(#search, ln),
'\'') SEPARATOR ' OR ')
)
FROM
(SELECT DISTINCT LENGTH(moduleName) ln
FROM Modules
WHERE LENGTH(moduleName)<=LENGTH(#search)) s
INTO #sql;
This will create a string with a SQL query that has a condition WHERE moduleName='abc' OR moduleName='abc_' OR ... and it should be able to create the string quickly because of the index (if not, it can be improved a lot using a temporary indexed table with numbers from 1 to the maximum allowed length of your string, example in fiddle given). Then you can just execute the query:
PREPARE stmt FROM #sql;
EXECUTE stmt;
Please see fiddle here.

like queries are not using indexes... but alternatively you can define an full text index for searching strings like this. but innodb engine is not supporting it, only myisam supports it.

Add index key to moduleName
check http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
B-Tree Index Characteristics for more info
Not sure why you using LIKE, its always best to avoid it. My suggestion would be to get all the rows save it in JSON and then perform AJAX search on it.

(previous part of answer deleted - see newtover's answer which is the same, but better, for that).
newtover's approach #2 (see SQLfiddle) will get you an index hit with a simple query, and should offer better performances on longer tables:
SELECT `moduleName`, `menuName`
FROM `Modules`
WHERE moduleName = LEFT('abc_def', LENGTH(moduleName));
If you need data from a lot of columns (instead of only menuName), i.e. if Modules is larger as well as longer, you might be better served by moving moduleName into a lookup table containing only an ID, the moduleName and its length (to save one function call).
The actual extra space needed is small, and if moduleName has a low cardinality, i.e., you have few moduleNames repeated along lots of menuNames, you might actually end up saving considerable space.
The new schema will be:
moduleName_id integer, keys to Lookup.id
...all the fields in Modules except moduleName...
Lookup table
id primary key
moduleName varchar
moduleLength integer
and the query:
SELECT `Lookup`.`moduleName`,`menuName`
FROM `Modules` INNER JOIN `Lookup`
ON (`Modules`.`moduleName_id` = Lookup.id)
WHERE `Lookup`.`moduleName` = LEFT('abc_def',
`Lookup`.`moduleLength`);
This SQLfiddle starts from your schema and modifies it to achieve the above. Speed and storage space improvements strongly depend on what data you put in the tables. I intentionally put myself in the best conditions (many short fields in Modules, an average of one hundred menuNames for each moduleName) and was able to save around 30% of storage space; the search performances were only around 3x as fast, and probably biased by I/O caching, so unless someone runs more thorough testing, I'd leave it at "appreciable space and time savings are possible".
On the other hand, on small, simple tables and equal number of menu and modules (i.e. 1:1), there will be a slight storage penalty for no appreciable speed gain. In that situation however spaces and times involved will be very small, so maybe the more "normalized" form above might still be the way to go, despite the added complexity.

We can achieve with one funciton itself instead of 2 functions as SUBSTRING('abc_def', 1, LENGTH(moduleName))
where locate(moduleName, 'abc_def');

Related

Very slow query when using `id in (max(id))` in subquery

We recently moved our database from MariaDB to AWS Amazon Aurora RDS (MySQL). We observed something strange in a set of queries. We have two queries that are very quick, but when together as nested subquery it takes ages to finish.
Here id is the primary key of the table
SELECT * FROM users where id in(SELECT max(id) FROM users where id = 1);
execution time is ~350ms
SELECT * FROM users where id in(SELECT id FROM users where id = 1);
execution time is ~130ms
SELECT max(id) FROM users where id = 1;
execution time is ~130ms
SELECT id FROM users where id = 1;
execution time is ~130ms
We believe it has to do something with the type of value returned by max that is causing the indexing to be ignored when running the outer query from results of the sub query.
All the above queries are simplified for illustration of the problem. The original queries have more clauses as well as 100s of millions of rows. The issue did not exist prior to the migration and worked fine in MariaDB.
--- RESULTS FROM MariaDB ---
MySQL seems to optimize less efficient compared to MariaDB (int this case).
When doing this in MySQL (see: DBFIDDLE1), the execution plans look like:
For the query without MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
For the query with MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY integers null index null
PRIMARY
4 null
1000
100.00 Using where; Using index
2 DEPENDENT SUBQUERY null null null null
null
null null
null
null Select tables optimized away
While MariaDB (see: DBFIDDLE2 does have a better looking plan when using MAX:
id select_type table type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY system null
null
null null
1
100.00
1 PRIMARY integers const PRIMARY
PRIMARY
4 const
1
100.00 Using index
2 MATERIALIZED null null null
null
null null
null
null Select tables optimized away
EDIT: Because of time (some lack of it 😉) I now add some info
A suggestion to fix this:
SELECT *
FROM integers
WHERE i IN (select * from (SELECT MAX(i) FROM integers WHERE i=1)x);
When looking at the EXECUTION PLAN from MariaDB, which has 1 extra step, I tried to do the same in MySQL. Above query has an even bigger execution plan, but tests show that it performs better. (for explain plans, see: DBFIDDLE1a)
"the question is Mariadb that much faster? it uses a step more that mysql"
One step more does not mean that things get slower.
MySQL takes about 2-3 seconds on the query using the MAX, and MariaDB does execute the same in under 10 msecs. But this is performance, and time may vary on different systems.
SELECT max(id) FROM users where id = 1
Is strange. Since it is looking only at rows where id = 1, then "max" is obviously "1". So is the min. And the average.\
Perhaps you wanted:
SELECT max(id) FROM users
Is there an index on id? Perhaps the PRIMARY KEY? If not, then that might explain the sluggishness.
This can be done much faster (against assuming an index):
SELECT * FROM users
ORDER BY id DESC
LIMIT 1
Does that give you what you want?
To discuss this further, please provide SHOW CREATE TABLE users

MySQL Merge Index Optimization not working

I was try simulate merge index on MySQL like say here http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html but i have no ideea why i dont take column type = 'index merge'.
Here i have contruction table:
create table hotel(
index1 int not null,
nume varchar(100),
index2 int
);
CREATE UNIQUE INDEX hotel_index1 ON hotel (index1);
CREATE UNIQUE INDEX hotel_index2 ON hotel (index2);
insert into hotel(index1,nume,index2) values (1,'primu',1),
(2,'al2lea',2),(5,'al3lea',4),(4,'al4lea',5);
and i do select like say on site :
explain extended select * from hotel where index1 = 5 or index2 = 4;
and the result row from explain is:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE hotel const hotel_index1,hotel_index2 (null) (null) (null) 4 100 Using where
What i do wrong with index and they do not merge like in theroy ?
SQL optimizers are really bad at or optimizations. An alternative is to use union or union all. The exact equivalent is:
select h.*
from hotel h
where h.index1 = 5
union
select h.*
from hotel h
where index2 = 4;
This should use the indexes correctly (assuming the tables are large enough to take advantage of an index).
Note: this uses union. If you don't need the elimination of duplicates, then use union all.
EDIT:
The documentation has this insightful comment:
The choice between different possible variants of the Index Merge
access method and other access methods is based on cost estimates of
various available options.
Apparently, the cost estimates don't lead the optimizer to the optimal choice.
The answar for that i don't see merge index optimization is because this apply if i use limit because then MySQL will divide or group and will try find best index and do search , and in the end he union all result.
explain extended select * from hotel where index1 = 5 OR index2 = 4 limit 3;
I get good answer:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE hotel index_merge hotel_index1,hotel_index2 hotel_index1,hotel_index2 4,5 (null) 2 100 Using union(hotel_index1,hotel_index2); Using where

Optimizing the performance of MySQL regarding aggregation

I'm trying to optimize a report query, as most of report queries this one incorporates aggregation. Since the size of table is considerable and growing, I need to tend to its performance.
For example, I have a table with three columns: id, name, action. And I would like to count the number of actions each name has done:
SELECT name, COUNT(id) AS count
FROM tbl
GROUP BY name;
As simple as it gets, I can't run it in a acceptable time. It might take 30 seconds and there's no index, whatsoever, I can add which is taken into account, nevertheless improves it.
When I run EXPLAIN on the above query, it never uses any of indices of the table, i.e. an index on name.
Is there any way to improve the performance of aggregation? Why the index is not used?
[UPDATE]
Here's the EXPLAIN's output:
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
| 1 | SIMPLE | tbl | ALL | NULL | NULL | NULL | NULL | 4025567 | 100.00 | Using temporary |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------+-----------------+
And here is the table's schema:
CREATE TABLE `tbl` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`name` varchar(1000) NOT NULL,
`action` int unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `inx` (`name`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The problem with your query and use of indexes is that you refer to two different columns in your SELECT statement yet only have one column in your indexes, plus the use of a prefix on the index.
Try this (refer to just the name column):
SELECT name, COUNT(*) AS count
FROM tbl
GROUP BY name;
With the following index (no prefix):
tbl (name)
Don't use a prefix on the index for this query because if you do, MySQL won't be able to use it as a covering index (will still have to hit the table).
If you use the above, MySQL will scan through the index on the name column, but won't have to scan the actual table data. You should see USING INDEX in the explain result.
This is as fast as MySQL will be able to accomplish such a task. The alternative is to store the aggregate result separately and keep it updated as your data changes.
Also, consider reducing the size of the name column, especially if you're hitting index size limits, which you most likely are hence why you're using the prefix. Save some room by not using UTF8 if you don't need it (UTF8 is 3 bytes per character for index).
It's a very common question and key for solution lies in fact, that your table is growing.
So, first way would be: to create index by name column if it isn't created yet. But: this will solve your issue for a time.
More proper approach would be: to create separate statistics table like
tbl_counts
+------+-------+
| name | count |
+------+-------+
And store your counts separately. When changing (insert/update or delete) your data in tbl table - you'll need to adjust corresponding row inside tbl_counts table. This way allows you to get rid of performing COUNT query at all - but will need to add some logic inside tbl table.
To maintain integrity of your statistics table you can either use triggers or do that inside application. This method is good if performance of COUNT query is much more important for you than your data changing queries (but overhead from changing tbl_counts table won't be too high)

MySQL query takes too long -- what should be the index?

Here is my query:
CREATE TEMPORARY TABLE temptbl (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temptbl (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
ALTER TABLE temptbl ADD INDEX (pibn,page);
SELECT word1,COUNT(*) AS aaa
FROM mytable a
INNER JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1 ORDER BY aaa DESC LIMIT 10;
DROP TABLE temptbl;
The issue is the SELECT word1,COUNT(*) AS aaa, specifically the count. That select statement takes 16 seconds.
EXPLAIN says:
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | b | ALL | pibn | NULL | NULL | NULL | 26778 | Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | w2pibnpage1,word21pibn,pibnpage | w2pibnpage1 | 9 | const,db.b.pibn,db.b.page | 4 | Using index |
+----+-------------+-------+------+---------------------------------+-------------+---------+-------------------------------------------------------------+-------+---------------------------------+
The index used (w2pibnpage1) is on:
word2,pibn,page,word1,id
I've been struggling with this for days, trying different combinations of columns for the index (which is annoying as it takes an hour to rebuild - millions of rows).
What should my indexes be, or what should I do to get this query to run in a fraction of a second (as it should)?
Here is a suggestion.
Presumably the temporary table is small. You can remove the index on that table, because a full table scan is fine there. In fact, that is what you want.
You then want indexes used on the big table. First the indexes need to match the join condition, then to match the where condition, and finally the group by condition. So, the suggestion is:
mytable(pibn, page, word2, word1, aaa)
I'm throwing in the order by column, so it doesn't have to fetch the value from the original data.
The query is taking a long time, but the expensive part seems to be accessing 'mytable' (you've not provided the structure of this) however the optimizer seems to think it only needs to fetch 4 rows from this using an index - which should be very fast. i.e. the data appears to be very skewed - how many rows does the last query examine (tally of counts)?
Without having a lok at the exact distribution of data, it's hard to be definitive - certainly you may need to hint the query to get it to work efficiently. The problem with designing indexes is that they should make all the queries faster - or at least give a reasonable tradeoff.
Looking at the predicates in the queries you've provided...
WHERE word1=429907 AND word2=0
Would be best served by an index on word1,word2,.... or word2,word1,.....
ON a.pibn=b.pibn AND a.page=b.page
WHERE a.word2=0
Would be best served by an index on mytable with word2+pibn+page in the leading columns.
How many distinct values are there for mytable.word1 and for mytable.word2? If word2 has a low number of distinct values (less than 20 or so) then it's not adding much selectivity to the index and can be omitted.
An index on word2,pibn,page,word1 gives you a covering index for the second query.
If your temptbl is small you want to first restrict the bigger table (mytable) and then join it (eventually by index) to your temptbl.
Right now, MySQL thinks it is better off by using the index of the bigger table to join.
You can get around this by doing a straight join:
SELECT word1,COUNT(*) AS aaa
FROM mytable a
STRAIGHT_JOIN temptbl b
ON a.pibn=b.pibn AND a.page=b.page
WHERE word2=0
GROUP BY word1
ORDER BY aaa DESC LIMIT 10;
This should use your index in mytable for the where clause and join mytable to temptbl via the index in temptbl.
If MySQL still wants to do it different, you can use FORCE INDEX to make it use the index.
With your data volumes is is not going to work fast no matter what you do, not without changing the schema.
If I understand you right, you're looking for the top words which go along with 429907 on the same pages.
You model as it it now would require counting all those words over an over again each time you run the query.
To speed it up, you would need to create an additional stats table:
CREATE TABLE word_pairs
(
word1_1 INT NOT NULL,
word1_2 INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (word1_1, word1_2),
INDEX (word1_1, cnt),
INDEX (word1_2, cnt)
)
and update it each time you're inserting a record into a large table (increase the cnt for the newly inserted word and all the words it's being on the same page with).
This would probably bee too slow for a single server as such updates would require some time, so you would also need to shard that table across multiple servers.
If you had such a table you could just run:
SELECT *
FROM word_pairs
WHERE word1_1 = 429907
ORDER BY
cnt DESC
LIMIT 10
which would be instant.
I came up with this:
CREATE TEMPORARY TABLE temp1 (
pibn INT UNSIGNED NOT NULL, page SMALLINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp1 (
SELECT pibn,page FROM mytable
WHERE word1=429907 AND word2=0);
CREATE TEMPORARY TABLE temp2 (
word1 MEDIUMINT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp2 (
SELECT a.word1
FROM mytable a, temp1 b
WHERE a.word2=0 AND a.pibn=b.pibn AND a.page=b.page);
DROP TABLE temp1;
CREATE INDEX index1 ON temp2 (word1) USING BTREE;
CREATE TEMPORARY TABLE temp3 (
word1 MEDIUMINT UNSIGNED NOT NULL, num INT UNSIGNED NOT NULL)
ENGINE=MEMORY;
INSERT INTO temp3 (SELECT word1,COUNT(*) AS aaa FROM temp2 USE INDEX (index1) GROUP BY word1);
DROP TABLE temp2;
CREATE INDEX index1 ON temp3 (num) USING BTREE;
SELECT word1,num FROM temp3 USE INDEX (index1) ORDER BY num DESC LIMIT 10;
DROP TABLE temp3;
Takes 5 seconds.

MySQL 5.5 "select distinct" is really slow

One of the things my app does a fair amount is:
select count(distinct id) from x;
with id the primary key for table x. With MySQL 5.1 (and 5.0), it looks like this:
mysql> explain SELECT count(distinct id) from x;
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
| 1 | SIMPLE | x | index | NULL | ix_blahblahblah | 1 | NULL | 1234567 | Using index |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
On InnoDB, this isn't exactly blazing, but it's not bad, either.
This week I'm trying out MySQL 5.5.11, and was surprised to see that the same query is many times slower. With the cache primed, it takes around 90 seconds, compared to 5 seconds before. The plan now looks like this:
mysql> explain select count(distinct id) from x;
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
| 1 | SIMPLE | x | range | NULL | PRIMARY | 4 | NULL | 1234567 | Using index for group-by (scanning) |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
One way to make it go fast again is to use select count(id) from x, which is safe because id is a primary key, but I'm going through some abstraction layers (like NHibernate) that make this a non-trivial task.
I tried analyze table x but it didn't make any appreciable difference.
It looks kind of like this bug, though it's not clear what versions that applies to, or what's happening (nobody's touched it in a year yet it's "serious/high/high").
Is there any way, besides simply changing my query, to get MySQL to be smarter about this?
UPDATE:
As requested, here's a way to reproduce it, more or less. I wrote this SQL script to generate 1 million rows of dummy data (takes 10 or 15 minutes to run):
delimiter $$
drop table if exists x;
create table x (
id integer unsigned not null auto_increment,
a integer,
b varchar(100),
c decimal(9,2),
primary key (id),
index ix_a (a),
index ix_b (b),
index ix_c (c)
) engine=innodb;
drop procedure if exists fill;
create procedure fill()
begin
declare i int default 0;
while i < 1000000 do
insert into x (a,b,c) values (1,"one",1.0);
set i = i+1;
end while;
end$$
delimiter ;
call fill();
When it's done, I observe this behavior:
5.1.48
select count(distinct id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 1.0 sec to run
select count(id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 0.5 sec to run
5.5.11
select count(distinct id) from x
EXPLAIN is: key: PRIMARY, Extra: Using index for group-by
takes over 7.0 sec to run
select count(id) from x
EXPLAIN is: key: ix_a, Extra: Using index
takes under 0.5 sec to run
EDIT:
If I modify the query in 5.5 by saying
select count(distinct id) from x force index (ix_a);
it runs much faster. Indexes b and c also work (to varying degrees), and even forcing index PRIMARY helps.
I'm not making any promises that this will be better but, as a possible work around, you could try:
SELECT COUNT(*)
FROM (SELECT id
FROM x
GROUP BY id) t
I'm not sure why you need DISTINCT on a unique primary key. It looks like MySQL is viewing the DISTINCT keyword as an operator and losing the ability to make use of the index (as would any operation on a field.) Other SQL engines also sometimes don't optimize searches on expressions very well, so it's not a surprise.
I note your comment in another answer about this being an artifact of your ORM. Have you ever read the famous Leaky Abstractions blog by Joel Spolsky? I think you are there. Sometimes you end up spending more time straightening out the tool than you spend on the problem you're using the tool to solve.
I dont know if you have realiased, but counting the rows on a large database with InnoDB is slow, even without the distinct keyword. InnoDB does not cache the rowcount in the table metadata, MyISAM does.
I would suggest you do one of two things
1) create a trigger that inserts/updates distinct counts into another table on insertion.
2) slave another MySQL server to your database, but change the table type on the slave only, to MyISAM and perform your query there (this is probarbly overkill).
I may be missreading your question, but if id is the primary key of table x, then the following two queries are logically equivalent:
select count(distinct id) from x;
select count(*) from x;
...regardless of whether the optimizer realizes this. Distinct generally implies a sort or scanning the index in order, which is considerably slower than just counting the rows.
Creative use of autoincrement fields
Note that your id is autoincrement.
It will add +1 after each insert.
However it does not reuse numbers, so if you delete a row you need to track of that.
My idea goes something like this.
Count(rows) = Max(id) - number of deletions - starting(id) + 1
Scenario using update
Create a separate table with the totals per table.
table counts
id integer autoincrement primary key
tablename varchar(45) /*not needed if you only need to count 1 table*/
start_id integer default maxint
delete_count
Make sure you extract the starting_id before the first delete(!) into the table and do
INSERT INTO counts (tablename, start_id, delete_count)
SELECT 'x', MIN(x.id), 0
FROM x;
Now create a after delete trigger.
DELIMITER $$
CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
UPDATE counts SET delete_count = delete_count + 1 WHERE tablename = 'x';
END $$
DELIMITER ;
IF you want to have the count, you do
SELECT max(x.id) - c.start_id + 1 - c.delete_count as number_of_rows
FROM x
INNER JOIN counts c ON (c.tablename = 'x')
This will give you your count instantly, with requiring a trigger to count on every insert.
insert scenario
If you have lots of deletes, you can speed up the proces by doing an insert instead of an update in the trigger and selecting
TABLE count_x /*1 counting table per table to keep track of*/
id integer autoincrement primary key /*make sure this field starts at 1*/
start_id integer default maxint /*do not put an index on this field!*/
Seed the starting id into the count table.
INSERT INTO counts (start_id) SELECT MIN(x.id) FROM x;
Now create a after delete trigger.
DELIMITER $$
CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
INSERT INTO count_x (start_id) VALUES (default);
END $$
DELIMITER ;
SELECT max(x.id) - min(c.start_id) + 1 - max(c.id) as number of rows
FROM x
JOIN count_x as c ON (c.id > 0)
You'll have to test which approach works best for you.
Note that in the insert scenario you don't need delete_count, because you are using the autoincrementing id to keep track of the number of deletions.
select count(*)
from ( select distinct(id) from x)