Can spatial indexes be used for joining tables in mysql?
I have two tables with a GEOMETRY column and a spatial index on it. The table is MyISAM.
I would like to join these tables on the rows which intersect.
I cannot get mysql to use the spatial indexes despite using FORCE INDEX in my query.
The query is:
SELECT *
FROM a FORCE INDEX FOR JOIN (asidx)
JOIN b FORCE INDEX FOR JOIN (bsidx)
ON Intersects(a.g, b.g) -- g is the name of the GEOMETRY
The explain plan is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | a | ALL | asidx | NULL | NULL | NULL | 50000 | |
| 1 | SIMPLE | b | ALL | bsidx | NULL | NULL | NULL | 50000 | Using where; Using join buffer |
Why aren't the indexes use?
For 50k rows tables it runs for 15 minutes.
How can I make it faster?
Yes, spatial indexes can be used for joins, but you aren't using an index for your join.
In order for a index to be used for a search, MySQL needs a constant, or an expression that refers to a data column that has already been read.
You're not referencing an indexed column in your ON clause. You're referencing 0 or 1, which is the result of the INTERSECTS() function.
Your ON clause specifies a function on columns from both tables in the join. MySQL won't have the column values required for the function until both records are already read, so no index can be used for the join, requiring a full scan.
Basically, MySQL doesn't know if two records go together until it tries it, so it must try every combination.
Perhaps a better solution would be to precalculate a table of intersection pairs, store them in a separate (junction) table, and join that way.
Related
What is the performance penalty for SELECT * FROM Table VS SELECT * FROM (SELECT * FROM Table AS A) AS B
My questions are: Firstly, does the SELECT * involve iteration over the rows in the table, or will it simply return all rows as a chunk without any iteration (because no WHERE clause was given), and if so does the nested query in example two involve iterating over the table twice, and will take 2x the time of the first query? thanks...
The answer to this question hinges on whether you are using mysql before 5.7, or 5.7 and after. I may be altering your question slightly, but hopefully the following captures what you are after.
Your SELECT * FROM Table does a table scan via the clustered index (the physical ordering). In the case of no primary key, one is implicitly available to the engine. There is no where clause as you say. No filtering or choice of another index is attempted.
The Explain output (see also) shows 1 row in its summary. It is relatively straight forward. The explain output and performance with your derived table B will differ depending on whether you are on a version before 5.7, or 5.7 and after.
The document Derived Tables in MySQL 5.7 describes it well for versions 5.6 and 5.7, where the latter will provide no penalty due to the change in materialized derived table output being incorporated into the outer query. In prior versions, substantial overhead was endured with temporary tables with the derived.
It is quite easy to test the performance penalty prior to 5.7. All it takes is a medium sized table to see the noticeable impact that your question's derived table has on impacting performance. The following example is on a small table in version 5.6:
explain
select qm1.title
from questions_mysql qm1
join questions_mysql qm2
on qm2.qid<qm1.qid
where qm1.qid>3333 and qm1.status='O';
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
| 1 | SIMPLE | qm1 | range | PRIMARY,cactus1 | PRIMARY | 4 | NULL | 5441 | Using where |
| 1 | SIMPLE | qm2 | ALL | PRIMARY,cactus1 | NULL | NULL | NULL | 10882 | Range checked for each record (index map: 0x3) |
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
explain
select b.title from
( select qid,title from questions_mysql where qid>3333 and status='O'
) b
join questions_mysql qm2
on qm2.qid<b.qid;
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
| 1 | PRIMARY | qm2 | index | PRIMARY,cactus1 | cactus1 | 10 | NULL | 10882 | Using index |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5441 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | questions_mysql | range | PRIMARY,cactus1 | PRIMARY | 4 | NULL | 5441 | Using where |
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
Note, I did change the question, but it illustrates the impact that derived tables and their lack of index use with the optimizer has in versions prior to 5.7. The derived table benefits from indexes as it is being materialized. But thereafter it endures overhead as a temporary table and is incorporated into the outer query without index use. This is not the case in version 5.7
I have a query:
select SQL_NO_CACHE id from users
where id>1 and id <1000
and id in ( select owner_id from comments and content_type='Some_string');
(note that it is short of an actual large query used for my sphinx indexing, representing the problem)
This query is taking about 3.5 seconds(modifying range from id = 1..5000 makes it about 15 secs).
users table has about 35000 entries and comments table has about 8000 entries.
Explain on above query:
explain select SQL_NO_CACHE id from users
where id>1 and id <1000
and id in ( select distinct owner_id from d360_core_comments);
| id | select_type | table | type | possible_keys
| key | key_len | ref | rows | Extra |
| 1 | PRIMARY | users | range | PRIMARY
| PRIMARY | 4 | NULL | 1992 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | d360_core_comments | ALL | NULL | NULL | NULL | NULL | 6901 | Using where; Using temporary |
where the individual subquery(select owner_id from d360_core_comments where content_type='Community20::Topic';) here is taking almost 0.0 seconds.
However if I add index on owner_id,content_type, (note the order here)
create index tmp_user on d360_core_comments (owner_id,content_type);
My subquery runs as is in ~0.0 seconds with NO index used:
mysql> explain select owner_id from d360_core_comments where
content_type='Community20::Topic';
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | d360_core_comments | ALL | NULL | NULL
| NULL | NULL | 6901 | Using where |
However now my main query (select SQL_NO_CACHE id from users where id>1 and id <1000 and id in ( select owner_id from d360_core_comments where content_type='Community20::Topic');)
now runs in ~0 seconds with following explain:
mysql> explain select SQL_NO_CACHE id from users where id>1 and id
<1000 and id in ( select owner_id from d360_core_comments where
content_type='Community20::Topic');
| id | select_type | table | type | possible_keys | key |
key_len | ref | rows | Extra |
| 1 | PRIMARY
| users | range | PRIMARY | PRIMARY | 4 | NULL | 1992 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY |
d360_core_comments | index_subquery | tmp_user | tmp_user | 5 | func | 34 | Using where |
So the main questions I have are:
If the index defined on the table used in my subquery is not getting used in my actual subquery then how it is optimizing the query here?
And why in the first place the first query was taking so much time when the actual subquery and main query independently are much faster?
What seems to happen in full query without the index is that MySQL will build (some sort of) temporary table of all the owner_id that the subquery generates. Then for each row from the users table that matches the id constraint, a lookup in this temporary construct will be performed. It is unclear if the overhead is creating the temporary construct, or if the lookup is implemented suboptimally (so that all elements are linearly matched for each row from the outer query.
When you create the index on owner_id, this doesn't change anything when you run only the subquery, because it has no condition on owner_id, nor does the index cover the content_type column.
However, when you run the full query with the index, there is more information available, since we now have values coming from the outer query that should be matched to owner_id, which is covered by the index. So the execution now seems to be to run the first part of the outer query, and for each matching row do an index lookup by owner_id. In other words, a possible execution plan is:
From Index-Users-Id Get all id matching id>1 and id <1000
For Each Row
Include Row If Index-Comment-OwnerId Contains row.Id
And Row Matches content_type='Some_string'
So in this case, the work to run 1000 (I assume) index lookups is faster than building a temporary construct of the 8000 possible owner_id. But this is only a hypothesis, since I don't know MySQL very well.
If you read this section of the MySQL Reference Manual: Optimizing Subqueries with EXISTS Strategy, you'll see that the query optimizer transforms your subquery condition from:
id in ( select distinct owner_id
from d360_core_comments
where content_type='Community20::Topic')
into:
exists ( select 1
from d360_core_comments
where content_type='Community20::Topic'
and owner_id = users.id )
This is why a index on (owner_id, content_type) is not useful when the subquery is tested as standalone query, but it is useful when considering the transformed subquery.
The first thing you should know is that MySQL can not optimize dependent subqueries, it is a for a long time well-known MySQL deficiency, that is going to be fixed in MySQL 6.x (just google for "mysql dependent subquery" and you will see). That is the subquery is basically executed for each matching row in users table. Since you have an additional condition, the overall execution time depends on that condition. The solution is to substitute the subquery with a join (the very optimization that you expect from MySQL under the hood).
Second, there is a syntax error in your subquery, and I think there was a condition on owner_id. Thus, when you add an index on owner_id it is used, but is not enough for the second condition (hence no using index), but why is not mentioned in EXPLAIN at all is a question (I think because of the condition on the users.id)
Third, I do not know why you need that id > 1 and id < 5000 condition, but you should understand that these are two range conditions that require very accurate, sometimes non-obvious and data-dependent indexing approach (as opposed to equality comparison conditions), and if you actually do not need them and use only to undestand why the query takes so long, then it was a bad idea and they would shed no light.
In case, the conditions are required and the index on owner_id is still there, I would rewrite the query as follows:
SELECT id
FROM (
SELECT owner_id as id
FROM comments
WHERE owner_id < 5000 AND content_type = 'some_string'
) as ids
JOIN users ON (id)
WHERE id > 1;
P.S. A composite index on (content_type, owner_id) will even be better for the query.
Step 1: Use id BETWEEN x AND y instead of id >= x AND id <= y. You may find some surprising gains because it indexes better.
Step 2: Adjust your sub-SELECT to do the filtering so it doesn't have to be done twice:
SELECT SQL_NO_CACHE id
FROM users
WHERE id IN (SELECT owner_id
FROM comments
WHERE content_type='Some_string'
AND owner_id BETWEEN 1 AND 1000);
There seems to be several errors in your statement. You're selecting 2 through 999 for instance, presumably off by one on both ends, and the subselect wasn't valid.
I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14 |
...
| 1002395157 | 2012-05-14 |
| 1002395187 | 2012-05-14 |
| 1002395475 | 2012-05-14 |
+------------+-------------+
105776 rows in set (2.19 sec)
mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id limit 1;
+------------+-------------+
| player_id | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14 |
+------------+-------------+
1 row in set (42 min 23.26 sec)
mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id | int(10) unsigned | NO | PRI | NULL | |
| insert_date | date | NO | PRI | NULL | |
| xp | int(10) unsigned | YES | | NULL | |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)
There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.
CREATE temporary TABLE date_curr
(
player_id INT UNSIGNED NOT NULL,
insert_date DATE,
PRIMARY KEY player_id (player_id, insert_date)
) ENGINE=MEMORY;
INSERT into date_curr
SELECT player_id,
MAX(insert_date) AS insert_date
FROM player_record
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;
CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev
SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER join date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id limit 0,20000;
date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.
These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.
Here is the explain output:
mysql> explain SELECT pr.player_id,
MAX(pr.insert_date) AS insert_date
FROM player_record pr
INNER JOIN date_curr dc
ON pr.player_id = dc.player_id
WHERE pr.insert_date < '2012-05-15'
GROUP BY pr.player_id
LIMIT 0,20000;
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | pr | range | PRIMARY,insert_date | insert_date | 3 | NULL | 396828 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 216825 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)
This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.
Here is the 'show profile all' output, with most time being spent on copying to tmp table.
| Status | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function | Source_file | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 | 0.383941 | 110240 | 18983 | 16160 | 448 | 0 | 0 | 0 | 43 | 0 | exec | sql_select.cc | 1976 |
A bit of a long answer but I hope you can learn something from this.
So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:
possible_keys
PRIMARY,insert_date
However the MySQL query optimizer decided to use the following index:
key
insert_date
This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.
When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.
So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.
Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.
Simple fact about a primary key index(aka clustered):
A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.
Simple fact about secondary index(aka non-clustered):
A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.
This is a subtle but big difference.
Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.
When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.
In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.
Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.
Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.
This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.
By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.
Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.
Hopefully this gives you a better understanding of indexes and why they are a double edged sword.
Had the same problem. When I added FORCE INDEX (id) it went back to the few milliseconds of a query it was without the limit, while producing the same results.
I have a table containing 60 million rows. The structure is like entryid, date, sourceid, detail, views. (entryid, date, sourceid, detail) is the PK, and I also have indexes for each field except views.
The problem is the cardinalities of the four indexes are zero, but I am sure they should not.
I wonder why is that? And does it mean the index doesn't work?
It's possible that the table statistics have not been updated.
See this page on optimizing MyISAM tables:
To help MySQL better optimize queries, use ANALYZE TABLE or run
myisamchk --analyze on a table after it has been loaded with data.
This updates a value for each index part that indicates the average
number of rows that have the same value. (For unique indexes, this is
always 1.) MySQL uses this to decide which index to choose when you
join two tables based on a nonconstant expression. You can check the
result from the table analysis by using SHOW INDEX FROM tbl_name and
examining the Cardinality value. myisamchk --description --verbose
shows index distribution information.
The best way to determine whether an index is helping is to explain a query:
mysql> explain select 1;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
1 row in set (0.00 sec)
Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?
There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).
Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.