Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b' won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.
I have the following MySQL query to achieve whole word matching.
SELECT source, target
FROM tm
WHERE source REGEXP '\\bword\\b'
AND customer = 'COMPANY X'
AND language = 'YYY'
ORDER BY CHAR_LENGTH(source)
LIMIT 5;
There are 2 customers and 2 languages currently.
My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b' makes sure source has word already.
The tm table:
CREATE TABLE tm(
id INT AUTO_INCREMENT PRIMARY KEY,
source TEXT(7000) NOT NULL,
target TEXT(6000) NOT NULL,
language CHAR(3),
customer VARCHAR(10),
INDEX src_cus_lang (source(755), customer, language)
The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.
The EXPLAIN command showed the below result:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tm
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1117154
filtered: 1.00
Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)
I tried to delelte the src_cus_lang index and created a new one (customer, language, source(755)), but no improvement at all.
I can think of a few solutions:
Recreate the tm table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.
Create a new column named src_len, i.e. the length of the source. However, ORDER BY src_len is still very slow.
Split the tm table into 4 separate ones by customer and language. Not ideal for me.
Index the source column. Still very slow.
Use INDEX(customer, language). Took 1.4 seconds longer for both normal words and special words like '.vnet'.
Is there a way to cut the execution time down to less than 0.5 seconds?
This is essentially useless:
INDEX src_cus_lang (source(755), customer, language)
The prefixing keeps the rest of the columns from being very useful. REGEXP requires checking all 1.1M rows.
This would be better:
INDEX(customer, language)
It will at least filter on those two columns, then apply the REGEXP fewer times.
Since it usually wants to finish with the WHERE before considering the ORDER BY, your attempts at src_len, etc, did not help.
If there are only 4 different combinations of customer and language, not much can be done.
However, you should consider a FULLTEXT(source) index. With such,
MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...
will work much faster.
Also try IN NATURAL LANGUAGE MODE.
Related
The following query takes 0.7s on a 2.5Ghz dual core Windows Server 2008 R2 Enterprise when run over a 4.5Gb MySql database. sIndex10 is a varchar(1024) column type:
SELECT COUNT(*) FROM e_entity
WHERE meta_oid=336799 AND sIndex10 = ''
An EXPLAIN shows the following information:
id: 1
select_type: SIMPLE
table: e_entity
type: ref
possible_keys: App_Parent,sIndex10
key: App_Parent
key_len: 4
ref: const
rows: 270066
extra: Using Where
There are 230060 rows matching the first condition and 124216 rows matching the clause with AND operator. meta_oid is indexed, and although sIndex10 is also indexed I think it's correctly not picking up this index as a FORCE INDEX (sIndex10), takes longer.
We have had a look at configuration parameters such as innodb_buffer_pool_size and they look correct as well.
Given that this table has already 642532 records, have we reached the top of the performance mysql can offer? Is it at this point investing in hardware the only way forward?
WHERE meta_oid=336799 AND sIndex10 = ''
begs for a composite index
INDEX(meta_oid, sIndex10) -- in either order
That is not the same as having separate indexes on the columns.
That's all there is to it.
Index Cookbook
One thing I alway do is just count(id) since id is (nearly) always indexed, counting just the id only has to look at the index.
So try running and see if it performs any better. You should also add SQL_NO_CACHE when testing to get a better idea of how the query performs.
SELECT SQL_NO_CACHE COUNT(id) FROM e_entity
WHERE meta_oid=336799 AND sIndex10 = ''
Note: This is probably not the complete answer for your question, but it was too long for just a comment.
I am trying learn how to optimize SQL statements and I was wondering if it's possible to estimate what might be making my queries slow just by seeing the execution plan.
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: <derived2>
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 382856
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: rf
type: ref
possible_keys: rec_id
key: rec_id
key_len: 4
ref: rs.id
rows: 7
Extra: Using index condition
*************************** 3. row ***************************
id: 2
select_type: DERIVED
table: f
type: range
possible_keys: facet_name_and_value,rec_id
key: facet_name_and_value
key_len: 309
ref: NULL
rows: 382856
Extra: Using index condition; Using where; Using temporary; Using filesort
*************************** 4. row ***************************
id: 2
select_type: DERIVED
table: r
type: ref
possible_keys: record_id
key: record_id
key_len: 9
ref: sqlse_test_crescentbconflate.f.rec_id
rows: 1
Extra: Using where; Using index
Just by looking at the execution plan I can see that I am using too many joins and the data is too big since SQL is using filesort, but I might be wrong.
No, it's not really possible to diagnose the performance issue from just the EXPLAIN output.
But the output does reveal that there's a view query that's returning (an estimated) 384,000 rows. We can't tell if that's a stored view, or an inline view. But we can see that that results from that query are being materialized into a table (MySQL calls it a "derived table"), and then the outer query is running against that. The overhead for that can be considerable.
What we can't tell if it's possible to get the same result without the view, to flatten the query. And if that's not possible, whether there are any predicates on the outer query that could be pushed down into the view.
A "Using filesort" isn't necessarily a bad thing. But that operation can become expensive for really large sets. So we do want to avoid unnecessary sort operations. (What we can't tell from the EXPLAIN output is whether it would be possible to avoid those sort operations.)
And if the query uses a "covering index" then the query is satisfied from the index pages, without needing to lookup/visit pages in the underlying table, which means less work to do.
Also, make sure the predicates are in a form that enables effective use of an index. That means having conditions on bare columns, not wrapping the columns in functions. e.g.
We want to avoid writing a condition like this:
where DATE_FORMAT(t.dt,'%Y-%m') = '2016-01'
when the same thing can be expressed like this:
where t.dt >= '2016-01-01' and t.dt < '2016-02-01'
With the former, MySQL has to evaluate the DATE_FORMAT function for every row in the table, and the compare the return from the function. With the latter form, MySQL could use a "range scan" operation on an index with dt as the leading column. A range scan operation has the potential to eliminate vast swaths of rows very efficiently, without actually needing to examine the rows.
To summarize, the biggest performance improvements would likely come from
avoiding creating a derived table (no view definitions)
pushing predicates into view definitions (where view definitions can't be avoided)
avoiding unnecessary sort operations
avoiding unnecessary joins
writing predicates in a form that can make use of suitable indexes
creating suitable indexes, covering indexes where appropriate
I'd look at the extra field in the execution plan, and then examine your query and your database schema to find ways to improve performance.
using temporary means a temporary table was used, which may slow down the query. Furthermore, temporary tables may end up being written to the disk (and not stored in RAM, which the server typically tries to do if it can) if they are too large.
According the MySQL 5.5 documentation, here are some reasons
temporary tables are created:
Evaluation of UNION statements.
Evaluation of some views, such those that use the TEMPTABLE algorithm, UNION, or aggregation.
Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY
contains columns from tables other than the first table in the join
queue.
Evaluation of DISTINCT combined with ORDER BY may require a temporary table.
For queries that use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements
(described later) that require on-disk storage.
Evaluation of multiple-table UPDATE statements.
Evaluation of GROUP_CONCAT() or COUNT(DISTINCT) expressions.
Then there's using filesort, which means that a sort was performed which could not be done with existing indexes. This could be no big deal, but you should check what fields are being sorted on and where your indexes are and make sure you're not giving MySQL too much work to do.
You may be able to use the execution plan to see why your queries run slowly because you know how your schema works (what columns and indexes you have). But, we here on Stack Overflow can't possibly use just the execution plan to help you.
There's nothing inherently wrong with filesort. It happens to have an unfortunate name; it simply means that satisfying the query requires sorting the results of a subquery. It doesn't necessarily mean the subquery's results have been placed in an actual file in a filesystem.
Try reading this fine tutorial. http://use-the-index-luke.com/
If you need help with a specific query, please ask another question. Include the following information:
The query.
The results of EXPLAIN
The definitions of the tables involved in the query, including indexes.
Pro tip: SELECT * is harmful to performance in big queries with lots of joins. In particular,
SELECT *
FROM gigantic_table
ORDER BY column
LIMIT 1
is an antipattern, because it slurps a huge amount of data, sorts it, and then discards all but one row of of the sorted result. Lots of data gets sloshed around in your server for a small result. That's wasteful, even if it is correct. You can do this kind of thing more efficiently with
SELECT *
FROM gigantic_table
WHERE column =
(SELECT MAX(column) FROM gigantic_table)
The best efficiency will come if column is indexed.
I mention this because the first row of your explain makes it look like you're romping through a great many rows to find something.
We have a table with about 25,000,000 rows called 'events' having the following schema:
TABLE events
- campaign_id : int(10)
- city : varchar(60)
- country_code : varchar(2)
The following query takes VERY long (> 2000 seconds):
SELECT COUNT(*) AS counted_events, country_code
FROM events
WHERE campaign_id` in (597)
GROUPY BY city, country_code
ORDER BY counted_events
We found out that it's because of the GROUP BY part.
There is already an index idx_campaign_id_city_country_code on (campaign_id, city, country_code) which is used.
Maybe someone can suggest a good solution to speed it up?
Update:
'Explain' shows that out of many possible index MySql uses this one: 'idx_campaign_id_city_country_code', for rows it shows: '471304' and for 'Extra' it shows: 'Using where; Using temporary; Using filesort' –
Here is the whole result of EXPLAIN:
id: '1'
select_type: 'SIMPLE'
table: 'events'
type: 'ref'
possible_keys: 'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
key: 'idx_campaign_id_city_country_code'
key_len: '4'
ref: 'const'
rows: '471304'
Extra: 'Using where; Using temporary; Using filesort'
UPDATE:
Ok, I think it has been solved:
Looking at the pasted query here again I realized that I forget to mention here that there was one more column in the SELECT called 'country_name'. So the query was very slow then (including country_name), but I'll just leave it out and now the performance of the query is absolutely ok.
Sorry for that mistake!
So thank you for all your helpful comments, I'll upvote all the good answers! There were some really helpful additions, that I probably also we apply (like changing types etc).
without seeing what EXPLAIN says it's a long distance shot, anyway:
make an index on (city,country_code)
see if there's a way to use partitioning, your table is getting rather huge
if country code is always 2 chars change it to char
change numeric indexes to unsigned int
post entire EXPLAIN output
don't use IN() - better use:
WHERE campaign_id = 597
OR campaign_id = 231
OR ....
afaik IN() is very slow.
update: like nik0lias commented - IN() is faster than concatenating OR conditions.
Some ideas:
Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table
Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)
Are you really using the city in the group by?
partitioning - especially by country will not help
column IN (const-list) is not slow, it is in fact a case with special optimization
The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.
The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.
So the indexes are already optimal for this query. MySQL is just not using them correctly.
I'm getting more information off line. It seems this is not a database problem at all, but
the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.
I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.
My query looks something like this:
SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;
With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).
This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.
I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
To my surprise, this query was actually slower than the straight join, taking north of a second.
Any ideas on why the join would be faster than the simplified query?
UPDATE: following a request in the comments, the outputs from explain:
Straight join:
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
| 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
No join:
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
UPDATE 2:
Tried another variant:
SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;
This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.
If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;
SET optimizer_trace="enabled=on";
## YOUR QUERY
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY
SELECT trace FROM information_schema.optimizer_trace;
SET optimizer_trace="enabled=off";
You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer
Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.
In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.
| ref | rows | Extra |
| schema.A.b_id | 1 | Using where |
So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.
Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.
Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra
In the first example we see
| 1 | SIMPLE | A | .... Using where |
| 1 | SIMPLE | B | ... Using where |
Both of these Using where is not good; ideally at least one, and preferably both should say Using index
When you do
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
and see Using where then you need to add an index as it's doing a table scan.
for example if you did
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)
You should see Using where; Using index (assuming here that Id is the primary key and has an index)
If you then added a condition onto the end
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0
and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.
So adding an index (on the two fields in the query above):
ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)
should change it such that when querying based upon those two fields there is an index.
I've tried this on one of my databases that has Transactions and User tables.
I'll use this query
EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
Running without index on the two fields:
PRIMARY,user PRIMARY 4 NULL 14334 Using where
Then add an index:
ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);
Then the same query again and this time
PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index
This time it's using the indexes - and as a result will be a lot quicker.
From comments by #Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)
SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)
would perform at least as good as
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.
The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).
For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.
The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.
So, when doing
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.
I do not see any strange behavior here. What you need is to understand the basics of how MySQL uses indexes. Here is an article I usually recommend: 3 ways MySQL uses indexes.
It is always funny to observe people writing things like WHERE (B.name != 'dummy') AND <condition>, because this AND <condition> might be the reason why MySQL optimizer chose the specific index, and there is no valid reason to compare the performance of the query with that of another one with WHERE b_id != 23 AND <condition>, because the two queries usually need different indexes to perform good.
One thing you should understand, is that MySQL likes equality comparisons, and does not like range conditions and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a != value.
So, let's compare the two queries.
With straight join
For each row in the A.id order (which is the primary key and is clustered, that is data is stored in its order on disk) take data for the row from disk to check if your <condition> is met and b_id, then (I repeat for each matching row) find the appropriate row for b_id, go on disk, take b.name, compare it with 'dummy'. Even though this plan in not at all efficient, you have only 200000 rows in your A table, so that it seems rather performant.
Without straight join
For each row in table B compare if name is matching, look into the A.b_id index (which is obviously sorted by b_id, since it is an index, hence contains A.ids in random order), and for each A.id for the given A.b_id find the corresponding A row on disk to check the <condition>, if it matches count the id, otherwise, discard the row.
As you see, there is nothing strange in the fact that the second query takes so long, you basically force MySQL to randomly access almost each row in A table, where in the first query you read the A table in the order it is stored on disk.
The query with no join does not use any index at all. It actually should take about the same as the query with straight join. My guess is that the order of the b_id!=23 and <condition> is significant.
UPD1: Could you still compare the performance of your query without join with the following:
SELECT COUNT(A.id)
FROM A
WHERE IF(b_id!=23, <condition>, 0);
UPD2: the fact the you do not see an index in EXPLAIN does not mean that no index is used at all. An index is at least used to define the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equility condition and the corresponding index, MySQL will use the index. So, basically, to understand which index is used you can look at the order in which rows are output. If the order is the same as the primary key, than no index was used (that is the primary key index was used), if the order of rows is shuffled - than there was some other index involved.
In your case, the second condition seems to be true for most of the rows, but the index is still used, that is to get b_id MySQL goes on disk in random order, that's why it is slow. No black magic here, and this second condition does affect the performance.
Probably this should be a comment rather than an answer but it will be a bit long.
First of all, it is hard to believe that two queries that have (almost) exactly the same explain run at different speed. Furthermore, this is less likely if the one with the extra line in the explain runs faster. And I guess the word faster is the key here.
You've compared speed (the time it takes for a query to finish) and that is an extremely empiric way of testing. For example, you could have improperly disabled the cache, which makes that comparison useless. Not to mention that your <insert your preferred software application here> could have made a page fault or any other operation at the time you've run the test that could have resulted in a decrease of the query speed.
The right way of measuring query performance is based on the explain (that's why it is there)
So the closest thing I have to answer the question: Any ideas on why the join would be faster than the simplified query?... is, in short, a layer 8 error.
I do have some other comments, though, that should be taken into account in order to speed things up. If A.id is a primary key (the name smells like it is), according to your explain, why does the count(A.id) have to scan all the rows? It should be able to get the data directly from the index but I don't see the Using index in the extra flags. It seems you don't even have a unique index on and that it is not a non nullable field. That also smells odd. Make sure that the field is not null and that there is a unique index on it, run the explain again, confirm the extra flags contain the Using index and then (properly) time the query. It should run much faster.
Also note that an approach that would result in the same performance improvement as I mentioned above would be to replace count(A.id) with count(*).
Just my 2 cents.
Because MySQL will not use index for index!=val in where.
The optimizer will decide to use an index by guessing. As a "!=" will more likely fetch everything, it skip and prevent using index to reduce overhead. (yes, mysql is stupid, and it does not statistic index column)
You may do a faster SELECT, by using index in(everything other then val), that MySQL will learn to use the index.
Example here showing query optimizer will choose to not use index by value
The answer to this question is actually a very simple consequence of algorithm design:
The key difference between these two queries is the merge operation.
Before I give a lesson on algorithms, I will mention the reason why the merge operation improves the performance. The merge improves the performance because it reduces the overall load on the aggregation. This is an iteration vs recursion issue. In the iteration analogy, we are simply looping through the entire index and counting the matches. In the recursion analogy, we are dividing and conquering (so to speak); or in other words, we are filtering the results that we need to count, thus reducing the volume of numbers we actually need to count.
Here are the key questions:
Why is a merge sort faster than an insertion sort?
Is a merge sort always faster than an insertion sort?
Let's explain this with a parable:
Let's say we have a deck of playing cards, and we need to sum the numbers of playing cards that have the numbers 7, 8 and 9 (assuming we don't know the answer in advance).
Let's say that we decide upon two ways to solve this problem:
We can hold the deck in one hand and move the cards to the table, one by one, counting as we go.
We can separate the cards into two groups: black suits and red suits. Then we can perform step 1 upon one of the groups and reuse the results for the second group.
If we choose option 2, then we have divided our problem in half. As a consequence, we can count the matching black cards and multiply the number by 2. In other words, we are re-using the part of the query execution plan that required the counting. This reasoning especially works when we know in advance how the cards were sorted (aka "clustered index"). Counting half of the cards is obviously much less time consuming than counting the entire deck.
If we wanted to improve the performance yet again, depending on how large the size of our database is, we may even further consider sorting into four groups (instead of two groups): clubs, diamonds, hearts, and spades. Whether or not we want to perform this further step depends on whether or not the overhead of sorting the cards into the additional groups is justified by the performance gain. In small numbers of cards, the performance gain is likely not worth the extra overhead required to sort into the different groups. As the number of cards grows, the performance gain begins to outweigh the overhead cost.
Here is an excerpt from "Introduction to Algorithms, 3rd edition," (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein):
(Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)
(Also, keep in mind that "n" is the number of objects we are dealing with.)
"As an example, in Chapter 2, we will see two algorithms for sorting.
The first, known as insertion sort, takes time roughly equal to c1n2
to sort n items, where c1 is a constant that does not depend on n.
That is, it takes time roughly proportional to n2. The second, merge
sort, takes time roughly equal to c2n lg n, where lg n stands for
log2 n and c2 is another constant that also does not depend on n.
Insertion sort typically has a smaller constant factor than merge
sort, so that c1 < c2. We shall see that the constant factors can
have far less of an impact on the running time than the dependence on
the input size n. Let’s write insertion sort’s running time as c1n ·
n and merge sort’s running time as c2n · lg n. Then we see that where
insertion sort has a factor of n in its running time, merge sort has
a factor of lg n, which is much smaller. (For example, when n = 1000,
lg n is approximately 10, and when n equals one million, lg n is
approximately only 20.) Although insertion sort usually runs faster
than merge sort for small input sizes, once the input size n becomes
large enough, merge sort’s advantage of lg n vs. n will more than
compensate for the difference in constant factors. No matter how much
smaller c1 is than c2, there will always be a crossover point beyond
which merge sort is faster."
Why is this relevant? Let us look at the query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.
I need to do a live search using PHP and jQuery to select cities and countries from the two tables cities (almost 3M rows) and countries (few hundred rows).
For a short moment I was thinking of using a MyISAM table for cities as InnoDB does not support FULLTEXT search, however decided it is not a way to go (frequent table crashes, all other tables are InnoDB etc and with MySQL 5.6+ InnoDB also starts to support FULLTEXT index).
So, right now I still use MySQL 5.1, and as most cities consist of one-word only or max 2-3 Words, but e.g. "New York" - most people will not search for "York" if they mean "New York". So, I just put an index on the city_real column (which is a varchar).
The following query (I tried it in different versions, without any JOIN and without ORDER BY, with USE INDEX and even with FORCE INDEX, I have tried LIKE instead equal (=) but another post said = was faster and if the wildcard is only at the end, it is OK to use it), in EXPLAIN it always says "using where, using filesort". The average time for the query is about 4sec, which you have to admit is a little bit to slow for a live search (user typing in text-box and seeing suggestions of cities and countries)...
Live search (jQuery ajax) searches if the user typed at least 3 characters...
SELECT ci.id, ci.city_real, co.country_name FROM cities ci LEFT JOIN countries co ON(ci.country_id=co.country_id) WHERE city_real='cit%' ORDER BY population DESC LIMIT 5
There is a PRIMARY on ci.id and an INDEX on ci.city_real. Any ideas why MySQL does not use the index? Or how I could speed up the query? Or where else I should/should not set an INDEX?
Thank you very much in advance for your help!
Here's the explain output
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ci range city_real city_real 768 NULL 1250 Using where; Using filesort
1 SIMPLE co eq_ref PRIMARY PRIMARY 6 fibsi_1.ci.country_id 1
You should use WHERE city_real LIKE 'cit%', not WHERE city_real='cit%'.
I have tried LIKE instead equal (=) but another post said = was faster and if the wildcard is only at the end, it is OK to use it
This is wrong. = doesn't support wildcards so it will give you the wrong results.
Or how I could speed up the query?
Make sure you have an index on country_id in both tables. Post the output of EXPLAIN SELECT ... if you need further help.
The query do use index as seen in the key field of explain output. The reason it uses filesort is the order by, and the reason it uses where is probably one of the fields (city_real, population) allows null values.