Why is my spatial index slow? - mysql

I have two tables
CREATE TABLE `city_landmark` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`location` geometry NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
SPATIAL KEY `spatial_index1` (`location`)
) ENGINE=InnoDB AUTO_INCREMENT=10001 DEFAULT CHARSET=latin1
CREATE TABLE `device_locations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`location` geometry NOT NULL,
PRIMARY KEY (`id`),
SPATIAL KEY `spatial_index_2` (`location`)
) ENGINE=InnoDB AUTO_INCREMENT=1000004 DEFAULT CHARSET=latin1
City landmark rows: 10000
Device locations rows: 1000002
I want to find out the number of rows in 'device_locations' is within a certain proximity of each city landmark.
SELECT *,
ifnull(
(
SELECT 1
FROM city_landmark cl force INDEX (spatial_index1)
where st_within(cl.location, st_buffer(dl.location, 1, st_buffer_strategy('point_circle', 6)) ) limit 1), 0) 'in_range'
FROM device_locations dl
LIMIT 200;
This is really slow for some reason. Please suggest a better method?
For some reason it makes no difference if spatial_index1 is used or not.
With index: 2.067 seconds
Without index: 2.016 seconds

I'm not familiar with mysql spatial, I use postgresql with postgis. But I will speculate a little bit.
I guess because you have to calculate the st_buffer you aren't able to get benefit of the index. The same is true with regular index when you do some function and alter the index field.
So if your city location is a point geometry, add another field city_perimeter and fill it with the result from st_buffer Then you can create a spatial index for city_perimeter.
Your query should become:
SELECT c.id, count(*)
FROM city_landmark c
JOIN device_locations d
ON st_within(c.city_perimeter, d.location)
GROUP BY c.id

Related

Django slow inner join on a table with more than 10 million records

I am using mysql with Django. I am trying to count the number of visitor_pages for a specific dealer in a certain amount of time.
I would share the raw sql query that I have obtained from django debug toolbar.
SELECT COUNT(*) AS `__count`
FROM `visitor_page`
INNER JOIN `dealer_visitors`
ON (`visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`)
WHERE (`visitor_page`.`date_time` BETWEEN '2021-02-01 05:51:00'
AND '2021-03-21 05:50:00'
AND `dealer_visitors`.`dealer_id` = 15)
The issue is that I have more than 13 million records in the visitor_pages table and about 1.5 million records in the dealer_visitor table. I have already indexed date_time. I am thinking of using a materialized view but before attempting that, I would really appreciate suggestions on how I could improve this query.
visitor_pages schema:
CREATE TABLE `visitor_page` (
`id` int NOT NULL AUTO_INCREMENT,
`date_time` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`page_id` int NOT NULL,
`dealer_visitor_id` int NOT NULL,
PRIMARY KEY (`id`),
KEY `visitor_page_page_id_246babdf_fk_web_page_id` (`page_id`),
KEY `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` (`dealer_visitor_id`),
KEY `visitor_page_date_time_06e9e9f5` (`date_time`),
CONSTRAINT `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` FOREIGN KEY (`dealer_visitor_id`) REFERENCES `dealer_visitors` (`id`),
CONSTRAINT `visitor_page_page_id_246babdf_fk_web_page_id` FOREIGN KEY (`page_id`) REFERENCES `web_page` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13626649 DEFAULT CHARSET=latin1;
dealer_visitors schema:
CREATE TABLE `dealer_visitors` (
`id` int NOT NULL AUTO_INCREMENT,
`visit_date` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`dealer_id` int NOT NULL,
`visitor_id` int NOT NULL,
`type` int DEFAULT NULL,
`notes` longtext,
`location` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dealer_visitors_dealer_id_306e2202_fk_dealer_id` (`dealer_id`),
KEY `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` (`visitor_id`),
KEY `dealer_visitors_type_af0f7d79` (`type`),
KEY `dealer_visitors_visit_date_f2b138c9` (`visit_date`),
CONSTRAINT `dealer_visitors_dealer_id_306e2202_fk_dealer_id` FOREIGN KEY (`dealer_id`) REFERENCES `dealer` (`id`),
CONSTRAINT `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` FOREIGN KEY (`visitor_id`) REFERENCES `visitor` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1524478 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
EXPLAIN ANALYZE the query gives me the following:
EXPLAIN:
For this query:
SELECT COUNT(*) AS `__count`
FROM visitor_page vp JOIN
dealer_visitors dv
ON vp.dealer_visitor_id = dv.id
WHERE vp.date_time BETWEEN '2021-02-01 05:51:00' AND '2021-03-21 05:50:00' AND
dv.dealer_id = 15;
The best indexes are on dealer_visitors(dealer_id, date_time, id) and visitor_page(dealer_visitor_id).
An index only on date helps a bit. But you are retrieving a month's worth of data and that might be a lot of data to process. Having dealer_id as the first column in the index will restrict the data to only the rows for that dealer in that time frame.
Depending on the distribution of the data, the Optimizer might pick one of the tables to start with, or pick the other. So, let's provide optimal indexes for each case:
ON `visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`
WHERE `visitor_page`.`date_time` BETWEEN ...
AND `dealer_visitors`.`dealer_id` = 15
Starting with visitor_page:
visitor_page: INDEX(date_time) -- (already exists)
dealer_visitors: (already has PRIMARY KEY(id))
Starting with dealer_visitors:
dealer_visitors: INDEX(dealer_id) -- (already exists)
visitor_page: INDEX(dealer_visitor_id, date_time) -- in this order
and drop dealer_visitors_visitor_id_27ae498e_fk_visitor_id as now being redundant.
The net is to add one index and drop one index.
Materialized view -- It is often best for Data Warehouse reports to build and incrementally maintain a "summary table" (a "materialized view"). The very odd date range (1 month + 20 days - 61 seconds) makes this clumsy to do. Typically it is handy to make the table based on whole days. If you can shift to daily (or hourly), then see http://mysql.rjweb.org/doc.php/summarytables
Something else to check: How much RAM do you have? What does SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; say?
I see that the tables have different charset/collation. This is not a problem for the query in question, but if you have other queries that JOIN on VARCHARs, check that they use the same collation.

Relatively simple SQL query with join refuses to be efficient

I'm having some problems optimizing a certain query in SQL(using MariaDB), to give you some context: I have a system with "events"(see them as log entries) that can occur on tickets, but also on some other objects besides tickets(which I why I seperated the event and ticket_event tables). I want to get all ticket_events sorted by display_time. The event table has ~20M rows right now.
CREATE TABLE IF NOT EXISTS `event` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`type` varchar(255) DEFAULT NULL,
`data` text,
`display_time` datetime DEFAULT NULL,
`created_time` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_for_display_time_and_id` (`id`,`display_time`),
KEY `index_for_display_time` (`display_time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `ticket_event` (
`id` int(11) NOT NULL,
`ticket_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ticket_id` (`ticket_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `ticket_event`
ADD CONSTRAINT `ticket_event_ibfk_1` FOREIGN KEY (`id`) REFERENCES `event` (`id`),
ADD CONSTRAINT `ticket_event_ibfk_2` FOREIGN KEY (`ticket_id`) REFERENCES `ticket` (`id`);
As you see I already played around with some keys(I also made one for (id, ticket_id) that doesn't show up here now since I removed it again) The query I execute:
SELECT * FROM ticket_event
INNER JOIN event ON event.id = ticket_event.id
ORDER BY display_time DESC
LIMIT 25
That query takes quite a while to execute(~30s if I filter on a specific ticket_id, can't even complete it reliably without filtering on it). If I run an explain on the query it shows it does a filesort + temporary:
I played around with force index etc. a bit, but that doesn't seem to solve anything or I did it wrong.
Does anyone see what I did wrong or what I can optimize here? I would very much prefer not to make "event" a wide table by adding ticket_id/host_id etc. as columns and just making them NULL if they don't apply.
Thanks in advance!
EDIT: Extra image of EXPLAIN with actual rows in the table:
OK what if you try to force the index?
SELECT * FROM ticket_event
INNER JOIN event
FORCE INDEX (index_for_display_time)
ON event.id = ticket_event.id
ORDER BY display_time DESC
LIMIT 25;
Your query selects every column from every row, even if you use a LIMIT. Have you tried to select one specific row by id?
KEY `index_for_display_time_and_id` (`id`,`display_time`),
is useless; DROP it. It is useless because you are using InnoDB, which stores the data "clustered" on the PK (id).
Please change ticket_event.id to event_id. id is confusing because it feels like the PK of the mapping table, which it is. But wait! That does not make sense? There is only one ticket for each event? Then why does ticket_event exist at all? Why not put ticket_id in event?
For a many-to-many table, do
CREATE TABLE IF NOT EXISTS `ticket_event` (
`event_id` int(11) NOT NULL,
`ticket_id` int(11) NOT NULL,
PRIMARY KEY (`event_id`, ticket_id), -- for lookup one direction
KEY (`ticket_id`, event_id) -- for the other direction
) ENGINE=InnoDB DEFAULT;
Maybe you will achieve a better performance by trying this:
SELECT *
FROM ticket_event
INNER JOIN (select * from event ORDER BY display_time DESC limit 25) as b
ON b.id = ticket_event.id;

mysql select with order by using filesort no index used

Sorry fot long post but this is really strange and I am close to give it up. 2 tables:
CREATE TABLE `endu_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_name` varchar(200) NOT NULL,
`base_nr` int(11) DEFAULT NULL,
`base_yob` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `endu_results_206a6355` (`base_name`),
KEY `endu_results_63df4402` (`base_nr`),
KEY `base_yob` (`base_yob`)
) ENGINE=InnoDB AUTO_INCREMENT=3424028 DEFAULT CHARSET=utf8;enter code here
and 2nd:
CREATE TABLE `endu_resultinterest` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`result_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `endu_resultinterest_3b529087` (`result_id`),
CONSTRAINT `result_id_refs_id_19e24435` FOREIGN KEY (`result_id`) REFERENCES `endu_results` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=48590 DEFAULT CHARSET=utf8;
There are about 2mln records in endu_resultstable and less then 100K i endu_resultinterest. I have slow query:
explain select base_yob from endu_resultinterest
inner join endu_results
on (endu_results.id = endu_resultinterest.result_id)
order by endu_results.base_yob;
1 SIMPLE endu_resultinterest index endu_resultinterest_3b529087 endu_resultinterest_3b529087 4 NULL 47559 Using index; Using temporary; Using filesort
The question is: Why mysql is using this index: endu_resultinterest_3b529087 - but it should use base_yob - this is where sorting is requested ?
To test it further I have manaully created 2 additional identical tables endu_testresults and endu_testresultintrest and filled those with some records:
CREATE TABLE `endu_testresults` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_yob` int(11) DEFAULT NULL,
`base_name` varchar(200) NOT NULL,
`base_nr` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `endu_testresults_a65b2616` (`base_yob`),
KEY `endu_testresults_ba0ab39c` (`base_name`),
KEY `endu_testresults_d75ba04d` (`base_nr`)
) ENGINE=InnoDB AUTO_INCREMENT=20 DEFAULT CHARSET=utf8;
So I go again for explain:
explain select base_yob from endu_testresultinterest
inner join endu_testresults
on (endu_testresults.id = endu_testresultinterest.result_id)
order by endu_testresults.base_yob;
and suprise suprise:
1 SIMPLE endu_testresults index PRIMARY endu_testresults_a65b2616 5 NULL 19 Using index
Index sort column base_yob (endu_testresults_a65b2616) is now used.
Why is that in one case index is used and in other I got 'using filesort;using temporary ? Does size matters ? I will try to copy records from one to another but do not get it with indexes. MySql is 5.6.16
Short answer: Because it is faster.
Long answer...
Your EXPLAINs seem to be incomplete -- I would expect 2 lines in each.
The first table is 20 (70?) times as big as the second. The optimizer picked the smaller table to start with. Hence it is initially doing 1/20th the amount of work. The sort that comes later (ORDER BY ...) is much less work than if it had to do 20 times as much work to start with.
The output is only 48K rows, correct? And that is how many rows in the 2nd table, correct?
Your test tables did not have the same bigger/smaller ratio, did they? Hence the different EXPLAIN.

MySQL optimization - large table joins

To start out here is a simplified version of the tables involved.
tbl_map has approx 4,000,000 rows, tbl_1 has approx 120 rows, tbl_2 contains approx 5,000,000 rows. I know the data shouldn't be consider that large given that Google, Yahoo!, etc use much larger datasets. So I'm just assuming that I'm missing something.
CREATE TABLE `tbl_map` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`tbl_1_id` bigint(20) DEFAULT '-1',
`tbl_2_id` bigint(20) DEFAULT '-1',
`rating` decimal(3,3) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `tbl_1_id` (`tbl_1_id`),
KEY `tbl_2_id` (`tbl_2_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tbl_1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tbl_2` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The Query in interest: also, instead of ORDER BY RAND(), ORDERY BY t.id DESC. The query is taking as much as 5~10 seconds and causes a considerable wait when users view this page.
EXPLAIN SELECT t.data, t.id , tm.rating
FROM tbl_2 AS t
JOIN tbl_map AS tm
ON t.id = tm.tbl_2_id
WHERE tm.tbl_1_id =94
AND tm.rating IS NOT NULL
ORDER BY t.id DESC
LIMIT 200
1 SIMPLE tm ref tbl_1_id, tbl_2_id tbl_1_id 9 const 703438 Using where; Using temporary; Using filesort
1 SIMPLE t eq_ref PRIMARY PRIMARY 8 tm.tbl_2_id 1
I would just liked to speed up the query, ensure that I have proper indexes, etc.
I appreciate any advice from DB Gurus out there! Thanks.
SUGGESTION : Index the table as follows:
ALTER TABLE tbl_map ADD INDEX (tbl_1_id,rating,tbl_2_id);
As per Rolando, yes, you definitely need an index on the map table but I would expand to ALSO include the tbl_2_id which is for your ORDER BY clause of Table 2's ID (which is in the same table as the map, so just use that index. Also, since the index now holds all 3 fields, and is based on the ID of the key search and criteria of null (or not) of rating, the 3rd element has them already in order for your ORDER BY clause.
INDEX (tbl_1_id,rating, tbl_2_id);
Then, I would just have the query as
SELECT STRAIGHT_JOIN
t.data,
t.id ,
tm.rating
FROM
tbl_map tm
join tbl_2 t
on tm.tbl_2_id = t.id
WHERE
tm.tbl_1_id = 94
AND tm.rating IS NOT NULL
ORDER BY
tm.tbl_2_id DESC
LIMIT 200

Optimization of a query with GROUP BY clause by using indexes

I need to optimize indexes in a table that stores more than 10 Millions rows. The query that is particularly time consuming takes up to 10 seconds to load (when WHERE clause filters only about 2 Millions rows - 8 Millions must be grouped). I have created a few indexes (some of them are complex, some simpler) and tried to find out how to speed this up. Perhaps I'm doing something wrong. MySQL is using optimized_5 index (based on EXPLAIN).
Here is the table's structure and the query:
CREATE TABLE IF NOT EXISTS `geo_reverse` (
`fid` mediumint(8) unsigned NOT NULL,
`tablename` enum('table1','table2') NOT NULL default 'table1',
`geo_continent` varchar(2) NOT NULL,
`geo_country` varchar(2) NOT NULL,
`geo_region` varchar(8) NOT NULL,
`geo_city` mediumint(8) unsigned NOT NULL,
`type` varchar(30) NOT NULL,
PRIMARY KEY (`fid`,`tablename`,`geo_continent`,`geo_country`,`geo_region`,`geo_city`),
KEY `geo_city` (`geo_city`),
KEY `fid` (`fid`),
KEY `geo_region` (`geo_region`,`geo_city`),
KEY `optimized` (`tablename`,`type`,`geo_continent`,`geo_country`,`geo_region`,`geo_city`,`fid`),
KEY `optimized_2` (`fid`,`tablename`),
KEY `optimized_3` (`type`,`geo_city`),
KEY `optimized_4` (`geo_city`,`tablename`),
KEY `optimized_5` (`tablename`,`type`,`geo_city`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
An example query:
SELECT type, COUNT(*) AS objects FROM geo_reverse WHERE tablename = 'table1' AND geo_city IN (5847207,5112771,4916894,...) GROUP BY type
Do you have any idea of how to speed the computation up?
i would use the following index: (geo_city, tablename, type) - geo_city is obviously more selective than tablename, thus it should be on the left. After the condition is applied, the rest should be sorted by type for grouping.