Mysql stored functions and groupwise min [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Schema
Database schema is simplified
Events table
This table stores events.
CREATE TABLE `Events` (
`event_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`isPublic` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`event_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Places table
Simple table that stores places. One event can be in more than one place.
CREATE TABLE `Places` (
`place_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`latitude` double NOT NULL,
`longitude` double NOT NULL,
PRIMARY KEY (`place_id`),
KEY `latind` (`latitude`,`longitude`)
) ENGINE=InnoDB CHARSET=latin1;
Rules table
Table that stores schedules of events. One event can have more that one schedule. All dates are in unixtimestamp format. Regular means that this rule has some repeating schedule that is stored in RegularRules table.
CREATE TABLE `Rules` (
`rule_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`start_date` int(11) NOT NULL,
`end_date` int(11) NOT NULL,
`regular` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`rule_id`),
KEY `endindx` (`end_date`)
) ENGINE=InnoDB CHARSET=latin1;
RegularRules
Table that stores repeatable schedules in the following format. day_start/end means number of seconds from the beggining of the day (00:00) to the starting of the event. For example, event takes place every monday from 10:00 to 18:00. We will store start_date and end_date in Rules table, these values represent time limits of the event. In the RegularRules table we will have 36000 in mon_start and 64800 in mon_end.
CREATE TABLE `RegularRules` (
`repetition_id` bigint(11) unsigned NOT NULL AUTO_INCREMENT,
`rule_id` bigint(20) unsigned NOT NULL,
`mon_start` int(11) DEFAULT NULL,
`tue_start` int(11) DEFAULT NULL,
`wed_start` int(11) DEFAULT NULL,
`th_start` int(11) DEFAULT NULL,
`fr_start` int(11) DEFAULT NULL,
`sat_start` int(11) DEFAULT NULL,
`sun_start` int(11) DEFAULT NULL,
`mon_end` int(11) DEFAULT NULL,
`tue_end` int(11) DEFAULT NULL,
`wed_end` int(11) DEFAULT NULL,
`th_end` int(11) DEFAULT NULL,
`fr_end` int(11) DEFAULT NULL,
`sat_end` int(11) DEFAULT NULL,
`sun_end` int(11) DEFAULT NULL,
PRIMARY KEY (`repetition_id`),
KEY `fk_rule_id_regularrules_idx` (`rule_id`),
CONSTRAINT `fk_rule_id_regularrules` FOREIGN KEY (`rule_id`)
REFERENCES `Rules` (`rule_id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB CHARSET=latin1;
Events-Places-Rules
Table that connects all of the above tables.
CREATE TABLE EPR (
`holding_id` bigint(30) NOT NULL AUTO_INCREMENT,
`event_id` bigint(20) unsigned NOT NULL,
`place_id` bigint(20) unsigned NOT NULL,
`rule_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`holding_id`),
UNIQUE KEY `compound` (`place_id`,`event_id`,`rule_id`),
KEY `FK_Places-Company Events-Rules_Events_event_id` (`event_id`),
KEY `FK_Places-Company Events-Rules_Places_place_id` (`place_id`),
KEY `FK_Places-Company Events-Rules_Rules_rule_id` (`rule_id`),
CONSTRAINT `FK_Places-Company Events-Rules_Events_event_id`
FOREIGN KEY (`event_id`) REFERENCES `Events` (`event_id`) ON DELETE CASCADE
ON UPDATE CASCADE,
CONSTRAINT `FK_Places-Company Events-Rules_Rules_rule_id`
FOREIGN KEY (`rule_id`) REFERENCES `Rules` (`rule_id`) ON DELETE CASCADE ON
UPDATE CASCADE,
CONSTRAINT `fk_place_id_pcerc` FOREIGN KEY (`place_id`)
REFERENCES `Places` (`place_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB CHARSET=latin1;
Stored functions
There are two stored functions. GETBEGINS and GETENDS. Parameters: rule_id, timestamp,curtimestamp.Timestamp is unixtimestamp of the day, curtimestamp is the unixtimestamp of the beggining of the current day.
These functions work as follows. For each rule they are returning the beggining of the rule(begins) and the ending(ends). If the rule is not repeatable, they return start_date and end_date that are stored in the Rules table. If the rule is repeatable, they construct begins and ends of the closest non-null day_start/day_end of the RegularRules table. For instance, there is an event that has 2 rules. The first one is not repeatable with begins start_timestamp and ends end_timestamp. The second one is repeatable and has just two non-null fields: mon_start = 36000 and mon_end = 64800. GETBEGINS will turn mon_start in unixtimestamp based on current unixtimetamp and current unixtimestamp of the beggining of the day. GETBEGINS works simmilarly. Code of these functions will be provided if necessary.
Problematic query
This query should return ongoing geographically- and chronologically-closest events. Places should be distinct. So query should for each place return chronologically closest event and at the end sort resulting values depending on time and distance with some coefficients (I think sorting part will be moved to the server-side language like PHP. If you have suggestions about this sorting I am open to any solution). For example, there are 5 movies in 10 cinemas near by. Each cinema has 100 schedules. Query should return for each cinema the chronologically closest movie and then sort movies and cinemas depending on two values time and distance.
Intended query
latpoin,longpoint,r - are coordinates and radius that are passed to the script,
curstamp - unixtimestamp of the beggining of the day,
timestamp - current unixtimestamp
SELECT
epr.event_id,
epr.place_id,
epr.rule_id,
(6371 * ACOS(COS(RADIANS(latpoint)) * COS(RADIANS(latitude)) *
COS(RADIANS(longitude) - RADIANS(longpoint)) + SIN(RADIANS(latpoint)) *
SIN(RADIANS(latitude)))) AS distance,
p.latitude,
p.longitude,
GETBEGINS(r.rule_id, curstamp, timestamp) AS begins,
GETENDS(r.rule_id, curstamp, timestamp) AS ends,
MIN(ABS(GETBEGINS(r.rule_id, curstamp, timestamp) - timestamp)) AS
time_min
FROM
Events e
INNER JOIN
EPR epr ON e.event_id = epr.event_id
INNER JOIN
Places p ON epr.place_id = p.place_id
INNER JOIN
Rules r ON epr.rule_id = r.rule_id
WHERE
r.end_date >= timestamp
AND latitude BETWEEN latpoint - (r / 111.045) AND latpoint + (r /
111.045)
AND longitude BETWEEN longpoint - (r / (111.045 *
COS(RADIANS(latpoint)))) AND longpoint + (r / (111.045 *
COS(RADIANS(latpoint))))
AND e.isPublic = 1
GROUP BY epr.place_id
As stated in the topic this query mixes returning values. To be more specific it matches wrong rule_id,begins,ends to the place_id group.
Moreover this query performs quite poorly. Table's size: Events - 3000rows, Places- 8000rows, Rules 18000rows, EPR-15000rows. These query works approximately 1.8 second when using index hint (use index compound) and 1.2 without one. Without using index hint query makes full table scan.
I have read official mysql docs regarding this subject. However their solution is not sutable because of user-calculated values (GETBEGINS and GETENDS).
Question
Query provided in the Intended query section has groupwise min problem because of the way mysql handles group by. So possible solution is to make functions GETBEGINS and GETENDS user-defined aggregated functions in this way mysql possibly will return appropiate result? Is this solution logical? Will making functions GETBEGINS and GETENDS aggregated help? Will mysql return appropiate data in that case?
Conclusion
Comments about provided solutions, new solutions, comments about indexing and about database architecture are appreciated and welcomed.

The groupwise max is not guaranteed to work. In fact, MariaDB broke it, but provided a setting to get it back. This is what I am referring to:
SELECT *
FROM
( SELECT ... ORDER BY ... )
GROUP BY ...
where you want the first (or last) in each group from the inner query. The problem is that SQL is free to optimize away that intent.
The groupwise max code in the docs is terribly inefficient.
To speed up the query, a likely bit of help is to isolate the Rules or Places part of the WHERE clause and make that into a subquery which returns just the PRIMARY KEY of the corresponding table. Then put that into a JOIN with all the tables (including a JOIN back to the same table). You already have a "covering index" for that subquery so that it can be "Using index" (in the jargon used by EXPLAIN).
Is innodb_buffer_pool_size set to about 70% of available RAM?
BIGINT takes 8 bytes; you could probably live with MEDIUMINT UNSIGNED (0..16M). Smaller --> more cacheable --> less I/O --> faster.
The pair of DOUBLEs for lat/lng take 16 bytes. A FLOAT pair would take 8 bytes and have 6-foot / 2m resolution. Or DECIMAL(6,4) for latitude and (7,4) for longitude for 7 bytes and a 52 foot / 16m resolution. Good enough for "stores", especially since you are using a 'square' instead of a 'circle' for distance.
Code for "find the nearest ..." is hard to optimize. Here is the best I have come up with: http://mysql.rjweb.org/doc.php/latlng

Related

Django slow inner join on a table with more than 10 million records

I am using mysql with Django. I am trying to count the number of visitor_pages for a specific dealer in a certain amount of time.
I would share the raw sql query that I have obtained from django debug toolbar.
SELECT COUNT(*) AS `__count`
FROM `visitor_page`
INNER JOIN `dealer_visitors`
ON (`visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`)
WHERE (`visitor_page`.`date_time` BETWEEN '2021-02-01 05:51:00'
AND '2021-03-21 05:50:00'
AND `dealer_visitors`.`dealer_id` = 15)
The issue is that I have more than 13 million records in the visitor_pages table and about 1.5 million records in the dealer_visitor table. I have already indexed date_time. I am thinking of using a materialized view but before attempting that, I would really appreciate suggestions on how I could improve this query.
visitor_pages schema:
CREATE TABLE `visitor_page` (
`id` int NOT NULL AUTO_INCREMENT,
`date_time` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`page_id` int NOT NULL,
`dealer_visitor_id` int NOT NULL,
PRIMARY KEY (`id`),
KEY `visitor_page_page_id_246babdf_fk_web_page_id` (`page_id`),
KEY `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` (`dealer_visitor_id`),
KEY `visitor_page_date_time_06e9e9f5` (`date_time`),
CONSTRAINT `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` FOREIGN KEY (`dealer_visitor_id`) REFERENCES `dealer_visitors` (`id`),
CONSTRAINT `visitor_page_page_id_246babdf_fk_web_page_id` FOREIGN KEY (`page_id`) REFERENCES `web_page` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13626649 DEFAULT CHARSET=latin1;
dealer_visitors schema:
CREATE TABLE `dealer_visitors` (
`id` int NOT NULL AUTO_INCREMENT,
`visit_date` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`dealer_id` int NOT NULL,
`visitor_id` int NOT NULL,
`type` int DEFAULT NULL,
`notes` longtext,
`location` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dealer_visitors_dealer_id_306e2202_fk_dealer_id` (`dealer_id`),
KEY `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` (`visitor_id`),
KEY `dealer_visitors_type_af0f7d79` (`type`),
KEY `dealer_visitors_visit_date_f2b138c9` (`visit_date`),
CONSTRAINT `dealer_visitors_dealer_id_306e2202_fk_dealer_id` FOREIGN KEY (`dealer_id`) REFERENCES `dealer` (`id`),
CONSTRAINT `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` FOREIGN KEY (`visitor_id`) REFERENCES `visitor` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1524478 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
EXPLAIN ANALYZE the query gives me the following:
EXPLAIN:
For this query:
SELECT COUNT(*) AS `__count`
FROM visitor_page vp JOIN
dealer_visitors dv
ON vp.dealer_visitor_id = dv.id
WHERE vp.date_time BETWEEN '2021-02-01 05:51:00' AND '2021-03-21 05:50:00' AND
dv.dealer_id = 15;
The best indexes are on dealer_visitors(dealer_id, date_time, id) and visitor_page(dealer_visitor_id).
An index only on date helps a bit. But you are retrieving a month's worth of data and that might be a lot of data to process. Having dealer_id as the first column in the index will restrict the data to only the rows for that dealer in that time frame.
Depending on the distribution of the data, the Optimizer might pick one of the tables to start with, or pick the other. So, let's provide optimal indexes for each case:
ON `visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`
WHERE `visitor_page`.`date_time` BETWEEN ...
AND `dealer_visitors`.`dealer_id` = 15
Starting with visitor_page:
visitor_page: INDEX(date_time) -- (already exists)
dealer_visitors: (already has PRIMARY KEY(id))
Starting with dealer_visitors:
dealer_visitors: INDEX(dealer_id) -- (already exists)
visitor_page: INDEX(dealer_visitor_id, date_time) -- in this order
and drop dealer_visitors_visitor_id_27ae498e_fk_visitor_id as now being redundant.
The net is to add one index and drop one index.
Materialized view -- It is often best for Data Warehouse reports to build and incrementally maintain a "summary table" (a "materialized view"). The very odd date range (1 month + 20 days - 61 seconds) makes this clumsy to do. Typically it is handy to make the table based on whole days. If you can shift to daily (or hourly), then see http://mysql.rjweb.org/doc.php/summarytables
Something else to check: How much RAM do you have? What does SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; say?
I see that the tables have different charset/collation. This is not a problem for the query in question, but if you have other queries that JOIN on VARCHARs, check that they use the same collation.

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

Select Exchange Rate based on Currency and Date [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I've looked around for a solution to this, but at least I was unable to find anything which would at least be similar to my case.
I need to select the exchange rate, based on the date a product was purchased.
Let me try and explain.
I have a table with Currencies:
CREATE TABLE `tblCurrencies` (
`CurrID` int(11) NOT NULL AUTO_INCREMENT,
`CurencySymbol` varchar(1) DEFAULT NULL,
`CurrencyCode` varchar(3) DEFAULT NULL,
`CurrencyDescription` varchar(100) DEFAULT NULL,
PRIMARY KEY (`CurrID`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;
A table with Exchange Rates:
CREATE TABLE `tblExchRates` (
`ExcID` int(11) NOT NULL AUTO_INCREMENT,
`CurrKey` int(11) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Exchange` decimal(11,3) DEFAULT NULL,
PRIMARY KEY (`ExcID`),
KEY `CurrKey` (`CurrKey`),
CONSTRAINT `tblExchRates_ibfk_1` FOREIGN KEY (`CurrKey`) REFERENCES `tblCurrencies` (`CurrID`) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=111 DEFAULT CHARSET=utf8;
And a table with Products (note my products are listed in numbers in the table, which is OK in my case):
CREATE TABLE `tblProducts` (
`ProductID` int(11) NOT NULL AUTO_INCREMENT,
`Contract` int(11) DEFAULT NULL,
`Product` int(11) DEFAULT NULL,
`Type` varchar(100) DEFAULT NULL,
`Currency` int(11) DEFAULT NULL,
`Amount` decimal(10,0) DEFAULT NULL,
`PurchaseDate` datetime DEFAULT NULL,
PRIMARY KEY (`ProductID`),
KEY `Contract` (`Contract`),
KEY `Currency` (`Currency`),
CONSTRAINT `tblShopCart_ibfk_2` FOREIGN KEY (`Currency`) REFERENCES `tblCurrencies` (`CurrID`) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT `tblShopCart_ibfk_1` FOREIGN KEY (`Contract`) REFERENCES `tblContracts` (`ContractID`) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3155 DEFAULT CHARSET=utf8;
In the Exchange Rates table, as an example, values are set like this:
CurrKey Date Exchange
1 15-01-2017 0.850
1 31-01-2017 0.856
1 02-02-2018 0.918
1 18-02-2018 0.905
2 04-02-2018 1.765
2 14-02-2018 1.755
And so on...
I want to have a query that select a unique exchange rate based on the date a product was purchased and the currency it was purchased.
In other words, as an example, if I have a product that was purchased on 07-02-2018, the query has to select the exchange rate which is valid in the relevant date rage that matches the purchase date and its currency. In this example, the correct exchange rage for a product purchased on 07-02-2018 which has a currkey of 1 would be 0.918
Please note that exchange rates are set on random dates (as per example above).
I managed to make this query, but it is not precise, as it sometimes returns two or more results (due to the 10 days range I set), whereas I only need 1 result:
SELECT
tblExchRates.Exchange
FROM
tblCurrencies
INNER JOIN tblExchRates ON tblExchRates.CurrKey = tblCurrencies.CurrID
WHERE
tblCurrencies.CurrencyCode = "EUR" AND
tblExchRates.Date BETWEEN (tblProducts.PurchaseDate - INTERVAL 10 DAY) AND (tblProducts.PurchaseDate + INTERVAL 10 DAY)
For a fairly simple solution you can do
SELECT
p.*
,(SELECT TOP 1 er.Exchange
FROM tblExchRates AS er
INNER JOIN tblCurrencies AS c ON er.CurrKey = c.CurrID AND c.CurrencyCode = 'EUR'
WHERE er.Date <= p.PurchaseDate
ORDER BY er.Date DESC) AS ExchangeRate
FROM
tblProducts AS p
Another option, if you have control over the schema then changing your exchange rates table to have a DateFrom and DateTo rather than just a date would then mean you can simply find the correct exchange rate using the BETWEEN keyword.
I am a beginner myself so no guarantees on correctness. I believe you have to use a certain application programming language along with SQL, for example PHP. Still, I will outline the basic steps I would take.
1. Assign the purchase currency ID and purchase date to variables using a simple SELECT statement. Assume I give the ID to targetID and date to targetDate.
2. SELECT MIN(Date) FROM tblExchRates WHERE Date <= targetDate AND CurrKey =targetID; //Select most recent date whose range includes the date of purchase among the matching currency IDs. Assign the result of this statement to another variable. Assume I used the variable dateRange.
3. SELECT Exchange FROM tblExchRates WHERE Date = targetDate; //Find the exchange rate of the selected date.
Note that there are many ways to do this. For example, you could use table JOINS (refer to this link: https://www.w3schools.com/sql/sql_join.asp ) or select columns from different tables in just one SQL statement (refer to this Stack overflow question: MySQL Select all columns from one table and some from another table). Last, you can use SQL to create variables (refer to this question: Set the variable result, from query) and then perform operations.

How to optimize slow query with many joins [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
My situation:
the query searches around 90,000 vehicles
the query takes long each time
I already have indexes on all the fields being JOINed.
How can I optimise it?
Here is the query:
SELECT vehicles.make_id,
vehicles.fuel_id,
vehicles.body_id,
vehicles.transmission_id,
vehicles.colour_id,
vehicles.mileage,
vehicles.vehicle_year,
vehicles.engine_size,
vehicles.trade_or_private,
vehicles.doors,
vehicles.model_id,
Round(3959 * Acos(Cos(Radians(51.465436)) *
Cos(Radians(vehicles.gps_lat)) *
Cos(
Radians(vehicles.gps_lon) - Radians(
-0.296482)) +
Sin(
Radians(51.465436)) * Sin(
Radians(vehicles.gps_lat)))) AS distance
FROM vehicles
INNER JOIN vehicles_makes
ON vehicles.make_id = vehicles_makes.id
LEFT JOIN vehicles_models
ON vehicles.model_id = vehicles_models.id
LEFT JOIN vehicles_fuel
ON vehicles.fuel_id = vehicles_fuel.id
LEFT JOIN vehicles_transmissions
ON vehicles.transmission_id = vehicles_transmissions.id
LEFT JOIN vehicles_axles
ON vehicles.axle_id = vehicles_axles.id
LEFT JOIN vehicles_sub_years
ON vehicles.sub_year_id = vehicles_sub_years.id
INNER JOIN members
ON vehicles.member_id = members.id
LEFT JOIN vehicles_categories
ON vehicles.category_id = vehicles_categories.id
WHERE vehicles.status = 1
AND vehicles.date_from < 1330349235
AND vehicles.date_to > 1330349235
AND vehicles.type_id = 1
AND ( vehicles.price >= 0
AND vehicles.price <= 1000000 )
Here is the vehicle table schema:
CREATE TABLE IF NOT EXISTS `vehicles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`number_plate` varchar(100) NOT NULL,
`type_id` int(11) NOT NULL,
`make_id` int(11) NOT NULL,
`model_id` int(11) NOT NULL,
`model_sub_type` varchar(250) NOT NULL,
`engine_size` decimal(12,1) NOT NULL,
`vehicle_year` int(11) NOT NULL,
`sub_year_id` int(11) NOT NULL,
`mileage` int(11) NOT NULL,
`fuel_id` int(11) NOT NULL,
`transmission_id` int(11) NOT NULL,
`price` decimal(12,2) NOT NULL,
`trade_or_private` tinyint(4) NOT NULL,
`postcode` varchar(25) NOT NULL,
`gps_lat` varchar(50) NOT NULL,
`gps_lon` varchar(50) NOT NULL,
`img1` varchar(100) NOT NULL,
`img2` varchar(100) NOT NULL,
`img3` varchar(100) NOT NULL,
`img4` varchar(100) NOT NULL,
`img5` varchar(100) NOT NULL,
`img6` varchar(100) NOT NULL,
`img7` varchar(100) NOT NULL,
`img8` varchar(100) NOT NULL,
`img9` varchar(100) NOT NULL,
`img10` varchar(100) NOT NULL,
`is_featured` tinyint(4) NOT NULL,
`body_id` int(11) NOT NULL,
`colour_id` int(11) NOT NULL,
`doors` tinyint(4) NOT NULL,
`axle_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
`contents` text NOT NULL,
`date_created` int(11) NOT NULL,
`date_edited` int(11) NOT NULL,
`date_from` int(11) NOT NULL,
`date_to` int(11) NOT NULL,
`member_id` int(11) NOT NULL,
`inactive_id` int(11) NOT NULL,
`status` tinyint(4) NOT NULL,
PRIMARY KEY (`id`),
KEY `type_id` (`type_id`),
KEY `make_id` (`make_id`),
KEY `model_id` (`model_id`),
KEY `fuel_id` (`fuel_id`),
KEY `transmission_id` (`transmission_id`),
KEY `body_id` (`body_id`),
KEY `colour_id` (`colour_id`),
KEY `axle_id` (`axle_id`),
KEY `category_id` (`category_id`),
KEY `vehicle_year` (`vehicle_year`),
KEY `mileage` (`mileage`),
KEY `status` (`status`),
KEY `date_from` (`date_from`),
KEY `date_to` (`date_to`),
KEY `trade_or_private` (`trade_or_private`),
KEY `doors` (`doors`),
KEY `price` (`price`),
KEY `engine_size` (`engine_size`),
KEY `sub_year_id` (`sub_year_id`),
KEY `member_id` (`member_id`),
KEY `date_created` (`date_created`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=136237 ;
The EXPLAIN:
1 SIMPLE vehicles ref type_id,make_id,status,date_from,date_to,price,mem... type_id 4 const 85695 Using where
1 SIMPLE members index PRIMARY PRIMARY 4 NULL 3 Using where; Using index; Using join buffer
1 SIMPLE vehicles_makes eq_ref PRIMARY PRIMARY 4 tvs.vehicles.make_id 1 Using index
1 SIMPLE vehicles_models eq_ref PRIMARY PRIMARY 4 tvs.vehicles.model_id 1 Using index
1 SIMPLE vehicles_fuel eq_ref PRIMARY PRIMARY 4 tvs.vehicles.fuel_id 1 Using index
1 SIMPLE vehicles_transmissions eq_ref PRIMARY PRIMARY 4 tvs.vehicles.transmission_id 1 Using index
1 SIMPLE vehicles_axles eq_ref PRIMARY PRIMARY 4 tvs.vehicles.axle_id 1 Using index
1 SIMPLE vehicles_sub_years eq_ref PRIMARY PRIMARY 4 tvs.vehicles.sub_year_id 1 Using index
1 SIMPLE vehicles_categories eq_ref PRIMARY PRIMARY 4 tvs.vehicles.category_id 1 Using index
Improving the WHERE clause
Your EXPLAIN shows that MySQL is only utilizing one index (type_id) for selecting the rows that match the WHERE clause, even though you have multiple criteria in the clause.
To be able to utilize an index for all of the criteria in the WHERE clause, and to reduce the size of the result set as quickly as possible, add a multi-column index on the following columns on the vehicles table:
(status, date_from, date_to, type_id, price)
The columns should be in order of highest cardinality to least.
For example, vehicles.date_from is likely to have more distinct values than status, so put the date_from column before status, like this:
(date_from, date_to, price, type_id, status)
This should reduce the rows returned in the first part of the query execution, and should be demonstrated with a lower row count on the first line of the EXPLAIN result.
You will also notice that MySQL will use the multi-column index for the WHERE in the EXPLAIN result. If, by chance, it doesn't, you should hint or force the multi-column index.
Removing the unnecessary JOINs
It doesn't appear that you are using any fields in any of the joined tables, so remove the joins. This will remove all of the additional work of the query, and get you down to one, simple execution plan (one line in the EXPLAIN result).
Each JOINed table causes an additional lookup per row of the result set. So, if the WHERE clause selects 5,000 rows from vehicles, since you have 8 joins to vehicles, you will have 5,000 * 8 = 40,000 lookups. That's a lot to ask from your database server.
Instead of expensive calculation of precise distance for all of the rows use a bounding box and calculate the exact distance only for rows inside the box.
The simplest possible example is to calculate min/max longitude and latitude that interests you and add it to WHERE clause. This way the distance will be calculated only for a subset of rows.
WHERE
vehicles.gps_lat > min_lat ANDd vehicles.gps_lat < max_lat AND
vehicles.gps_lon > min_lon AND vehicles.gps_lon < max_lon
For more complex solutions see:
MySQL spatial extensions
How to use MySQL spatial extensions
https://stackoverflow.com/a/5237509/342473
Is you SQL faster without this?
Round(3959 * Acos(Cos(Radians(51.465436)) *
Cos(Radians(vehicles.gps_lat)) *
Cos(Radians(vehicles.gps_lon) -
Radians(-0.296482)) +
Sin(Radians(51.465436)) *
Sin(Radians(vehicles.gps_lat)))) AS distance
performing math equation is very expensive
Maybe you should consider a materialized view that pre-calculate you distance, and you can select from that view. Depending on how dynamic you data is, you may not have to refresh you data too often.
To be a little more specific than #Randy of indexes, I believe his intention was to have a COMPOUND index to take advantage of your querying critieria... One index that is built on a MINIMUM of ...
( status, type_id, date_from )
but could be extended to include the date_to and price too, but don't know how much the index at that granular level might actually help
( status, type_id, date_from, date_to, price )
EDIT per Comments
You shouldn't need all those individual indexes... Yes, the Primary Key by itself. However, for the others, you should have compound indexes based on what your common querying criteria might be and remove the others... the engine might get confused on which might be best suited for the query. If you know you are always looking for a certain status, type and date (assuming vehicle searches), make that as one index. If the query is looking for such information, but also prices within that criteria it will already be very close on the few indexed records that qualify and fly through the price as just an extra criteria.
If you offer querying like Only Automatic vs Manual transmission regardless of year/make, then yes, that could be an index of its own. However, if you would TYPICALLY have some other "common" criteria, tack that on as a secondary that MAY be utilized in the query. Ex: if you look for Manual Transmissions that are 2-door vs 4-door, have your index on (transmission_id, category_id).
Again, you want whatever will help narrow down the field of criteria based on some "minimum" condition. If you tack on an extra column to the index that might "commonly" be applied, that should only help the performance.
To clarify this as an answer: if you do not already have these indexes, you should consider adding them
do you also have indexes on these:
vehicles.status
vehicles.date_from
vehicles.date_to
vehicles.type_id
vehicles.price

Optimising a slow MySQL query

I have a MySQL query as follows:
SELECT KeywordText, SUM(Frequency) AS Frequency FROM Keyword, Keyword_Polling_Frequency_Index
WHERE Keyword.KeywordText
IN ('deal', 'obama' and other keywords...)
AND RSSFeedNo IN (106, 107 and other RSS feeds)
AND PollingDateTime
BETWEEN '2011-10-28 13:00:00' AND '2011-10-28 13:59:00'
AND Keyword.KeywordNo = Keyword_Polling_Frequency_Index.KeywordNo
GROUP BY Keyword.KeywordText
ORDER BY Keyword.KeywordText ASC
The query is used by an hourly batch program which involves two tables and is meant to get the frequencies of a list of keywords from a list of RSS feeds for a given hour. The Keyword_Polling_Frequency_Index table has a composite primary key of KeywordNo, RSSFeedNo and PollingDateTime. The query joins this table to the Keyword table which contains the KeywordText. column keywordText has a MySQL MyISAM full text index.
In testing this was found to perform satisfactorily but has now started running very slowly and affects the interactive speed of pages of the application. When I check the MySQL logs, I find that MySQL is creating temporary tables.
So, my question is, given that this query has to handle dozens of keywords in dozens of RSS feeds to calculate the frequencies, can anyone suggest an optimisation?
I have thought of breaking the query up by keyword but am not convinced of the practicality of this.
Can anyone help?
I am using MySQL Community Edition 5.X and an EXTENDED EXPLAIN of a version of this query is shown above.
SQL for the tables is as follows:
CREATE TABLE `keyword` (
`KeywordNo` int(10) unsigned NOT NULL AUTO_INCREMENT,
`KeywordText` varchar(64) NOT NULL,
`UserOriginated` enum('TRUE','FALSE') NOT NULL,
`Active` enum('TRUE','FALSE') NOT NULL,
`UserNo` varchar(50) NOT NULL,
`StopWord` enum('TRUE','FALSE') NOT NULL,
`CreatedDate` date NOT NULL,
`CreatedTime` time NOT NULL,
PRIMARY KEY (`KeywordNo`),
FULLTEXT KEY `KEYWORDTEXT` (`KeywordText`)
) ENGINE=MyISAM AUTO_INCREMENT=44047 DEFAULT CHARSET=latin1$$
CREATE TABLE `keyword_polling_frequency_index` (
`KeywordNo` int(10) unsigned NOT NULL,
`RSSFeedNo` int(10) unsigned NOT NULL,
`PollingDateTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`Frequency` int(10) NOT NULL,
`Active` enum('TRUE','FALSE') NOT NULL,
`UserNo` varchar(50) NOT NULL,
PRIMARY KEY (`KeywordNo`,`RSSFeedNo`,`PollingDateTime`),
KEY `FK_keyword_polling_frequency_index_1` (`UserNo`),
CONSTRAINT `FK_keyword_polling_frequency_index_1` FOREIGN KEY (`UserNo`) REFERENCES `user` (`UserNo`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1$$
As mentioned previously, add an index to the PollingDateTime field in the order mentioned as well. This is my suggestion:
SELECT
K.KeywordText,
SUM(F.Frequency) AS Frequency
FROM
Keyword K, Keyword_Polling_Frequency_Index F
WHERE
EXISTS
(
SELECT 1
FROM Keyword K1
WHERE
MATCH K1.KeywordText AGAINST ('deal obama "another keyword" yetanother' IN BOOLEAN MODE)
AND K1.KeywordNo = K.KeywordNo
)
AND K.KeywordNo = F.KeywordNo
AND F.PollingDateTime BETWEEN '2011-10-28 13:00:00' AND '2011-10-28 13:59:00'
AND F.RSSFeedNo IN (106, 107, 110)
GROUP BY K.KeywordText
ORDER BY K.KeywordText ASC
This will probably reduce the number of records for the comparison (SQL inside-out parsing) instead of directly matching two tables (N x N).
If you don't have any indexes you should create relevant indexes.
The minimum index is on keyword_polling_frequency_index.PollingDateTime