MySQL stops using index when additional constraints are added - mysql

Using EXPLAIN reveals that the following query does not use my index, could somebody please explain what is going on?
SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
FROM users u
WHERE u.latitude between lat1 and lat2 -- MySQL 5.7 supports Point data type, but it is not indexed in innoDB. I store latitude and longitude as DOUBLE for now
AND u.longitude between lon1 and lon2
AND u.dateOfBirth between maxAge and minAge -- dates are in millis, therefore maxAge will have a smaller value than minAge and so it needs to go first
AND IF(gender is null, TRUE, u.gender = gender)
AND IF(activityType is null, TRUE, u.preferredActivityType = activityType)
AND u.accountState = 'A'
AND u.id != userId
HAVING distanceInMiles < searchRadius ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;
CREATE INDEX `findMatches` ON `users` (`latitude` ASC, `longitude` ASC, `dateOfBirth` ASC) USING BTREE;
The index is not used at all at this stage. To get it to work, I need to comment out a bunch of columns from the SELECT statement, and also removed any unindexed columns from the WHERE clause. The following works:
SELECT u.id AS userId --, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
FROM users u
WHERE u.latitude between lat1 and lat2 -- MySQL 5.7 supports Point data type, but it is not indexed in innoDB. We store latitude and longitude as DOUBLE for now
AND u.longitude between lon1 and lon2
AND u.dateOfBirth between maxAge and minAge -- dates are in millis, therefore maxAge will have a smaller value than minAge and so it needs to go first
-- AND IF(gender is null, TRUE, u.gender = gender)
-- AND IF(activityType is null, TRUE, u.preferredActivityType = activityType)
-- AND u.accountState = 'A'
-- AND u.id != userId
HAVING distanceInMiles < searchRadius ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;
Other things I tried:
I tried creating 3 distinct single-part indexes, in addition to my multi-part index that contains all 3 keys. Based on the docs here, shouldn't the optimizer merge them by creating a UNION of their qualifying rows, further speeding up execution? It's not doing it, it still selects the multi-part (covering) index.
Any help greatly appreciated!

This is a little difficult to explain.
The query that uses the index is using it because the index is a "covering" index. That is, all the column in the index are in the query. The only part of the index really being used effectively is the condition on latitude.
Normally a covering index would have only the columns mentioned in the query. However, the primary key is used to reference the records, so I'm guessing that users.Id is the primary key on the table. And the index is being scanned for valid values of latitude.
The query that is not using the index is not using it for two reasons. First, the conditions on the columns are inequalities. An index seek can only use equality conditions and one inequality. That means the index could only be used for latitude in its most effective method. Second, the additional columns in the query require going to the data page anyway.
In other words, the optimizer is, in effect, saying: "Why bother going to the index to scan through the index and then scan the data pages? Instead, I can just scan the data pages and get everything all at once."
Your next question is undoubtedly: "But how do I make my query faster?" My suggestion would be to investigate spatial indexes.

Related

What is the most efficient way to know if a MySQL longblob is empty?

I have a MySQL table or around 150,000 rows and a good half of them have a blob (image) stored in a longblob field. I'm trying to create a query to select rows and include a field that simply indicates that the longblob (image) is exists. Basically
select ID, address, IF(house_image != '', 1, 0) AS has_image from homes where userid='1234';
That query times out after 300 seconds. If I remove the 'IF(house_image != '', 1, 0)' it completes in less than a second. I've also tried the following, but they all time out.
IF(ISNULL(house_image),0,1) as has_image
LEFT (house_image,1) AS has_image
SUBSTRING(house_image,0,1) AS has_image
I am not a DBA (obviously), but I'm suspecting that the query is selecting the entire longblob to know if it's empty or null.
Is there an efficient way to know if a field is empty?
Thanks for any assistance.
I had similar problem long time ago and the workaround I ended up with was to move all blob/text columns into a separate table (bonus: this design allows multiple images per home). So once you've changed the design and moved the data around you could do this:
select id, address, (
select 1
from home_images
where home_images.home_id = homes.id
limit 1
) as has_image -- will be 1 or null
from homes
where userid = 1234
PS: I make no guarantees. Depending on storage engine and row format, the blobs could get stored inline. If that is the case then reading the data will take much more disk IO than needed even if you're not "select"ing the blob column.
It looks to me like you are treating the house_image column as a string when really you should be checking it for NULL.
select ID, address, IF(house_image IS NOT NULL, 1, 0) AS has_image
from homes where userid='1234';
LONGBLOBs can be indexed in MariaDB / MySQL, but the indexes are imperfect: they are so-called prefix indexes, and only consider the first bytes of the BLOB.
Try creating this compound index with a 20-byte prefix on your BLOB.
ALTER TABLE homes ADD INDEX user_image (userid, house_image(20));
Then this subquery will, efficiently, give you the IDs of rows with empty house_image columns.
SELECT ID
FROM homes
WHERE userid = '1234'
AND (house_image IS NULL OR house_image = '')
The prefix index can satisfy (house_image IS NULL OR house_image = '') directly without inspecting the BLOBs. That saves a whole mess of IO and CPU on your database server.
You can then incorporate your subquery into a main query to get your result.
SELECT h.ID, h.address,
CASE WHEN empty.ID IS NULL 1 ELSE 0 END has_image
FROM homes h
LEFT JOIN (
SELECT ID
FROM homes
WHERE userid = '1234'
AND (house_image IS NULL OR house_image = '')
) empty ON h.ID = empty.ID
WHERE h.userid = '1234'
The IS NULL ... LEFT JOIN trick means "any rows that do NOT show up in the subquery have images."

MySQL query optimization with complex index

I have a database used for simple reverse geocoding. The database rely on a table containing latitude, longitude and place name. Everytime a couple latitude,longitude is not present or, better, everytime the searched latitude,longitude differs too much from an existing latitude, longitude, I add a new row using GoogleMaps reverse geocoding service.
Below the code to generate the address table:
CREATE TABLE `data_addresses` (
`ID` int(11) NOT NULL COMMENT 'Primary Key',
`LAT` int(11) NOT NULL COMMENT 'Latitude x 10000',
`LNG` int(11) NOT NULL COMMENT 'Longitude x 10000',
`ADDRESS` varchar(128) NOT NULL COMMENT 'Reverse Geocoded Street Address'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `data_addresses`
ADD PRIMARY KEY (`ID`),
ADD UNIQUE KEY `IDX_ADDRESS_UNIQUE_LATLNG` (`LAT`,`LNG`),
ADD KEY `IDX_ADDRESS_LAT` (`LAT`),
ADD KEY `IDX_ADDRESS_LNG` (`LNG`);
ALTER TABLE `data_addresses`
MODIFY `ID` int(11) NOT NULL AUTO_INCREMENT COMMENT 'Primary Key';
As you can see the trick is to use place two indexes on Latitude and Longitude. As normally latitude and longitude are float we use their value multiplied by 10000, so each couple latitude/longitude is unique. This implies a resolution of about 50m that is satisfying for my needs.
Now the problem: everytime I need to know if a given latitude/longitude (MyLat,MyLon) is already present or not I execute the following query:
SELECT `id`, ROUND(SQRT(POW(ABS(`LAT`-ROUND(MyLat*10000)),2)+POW(ABS(`LNG`-ROUND(MyLon*10000)),2))) AS R FROM splc_smarttrk.`data_addresses` ORDER BY R ASC LIMIT 1
This query will return to me the closest point and will give me also R (the rating): smaller R means closest approximation, so let say that everytime I find an R that is above 10 I need to add a new row to address table.
Address table at present contains about 615k rows.
The problem is that despite indexes that I have placed this query is too slow (takes about 2 seconds on a 2x Xeon server). Below the results of Explain:
Can't you optimize this by retriving a fixed dataset of nearby latitude(s) and longitude(s) and calculate the Rating (R) and pick the smallest Rating on this fixed dataset.
p.s not tested might contain errors in the sorting. but it might help you on your way.
SELECT
id
, ROUND(SQRT(POW(ABS(`LAT`-ROUND([LAT]*10000)),2)+POW(ABS(`LNG`- ROUND([LNG]*10000)),2))) AS R
FROM (
SELECT
LAT
FROM
data_addresses
WHERE
LAT <= [LAT]
ORDER BY
LAT DESC
LIMIT 100
UNION ALL
SELECT
LAT
FROM
data_addresses
WHERE
LAT >= [LAT]
ORDER BY
LAT ASC
LIMIT 100
SELECT
LNG
FROM
data_addresses
WHERE
LNG <= [LNG]
ORDER BY
LNG DESC
LIMIT 100
UNION ALL
SELECT
LNG
FROM
data_addresses
WHERE
LNG >= [LNG]
ORDER BY
LNG ASC
LIMIT 100
)
AS data_addresses_range
ORDER BY
R ASC
LIMIT 1
Instead of computing the distance (or in addition to), provide a "bounding box". This will be much faster.
Still faster would be the complex code here: mysql.rjweb.org/doc.php/latlng
Once you have UNIQUE KEY IDX_ADDRESS_UNIQUE_LATLNG (LAT, LNG), there is no need for KEY IDX_ADDRESS_LAT (LAT)
*10000 can fit in MEDIUMINT. And it is good to about 16 meters or 52 feet.
Following the suggestion of Raymond Nijland I modified the query as follows:
SELECT `id` AS ID,
ROUND(SQRT(POW(ABS(`LAT`-ROUND(NLat*10000)), 2) +
POW(ABS(`LNG`-ROUND(NLon*10000)), 2))
) AS RT INTO ADDR_ID, RATING
FROM splc_smarttrk.`data_addresses`
WHERE (`LAT` BETWEEN (ROUND(NLat*10000)-R) AND (ROUND(NLat*10000)+R))
AND (`LNG` BETWEEN (ROUND(NLon*10000)-R) AND (ROUND(NLon*10000)+R))
ORDER BY RT ASC
LIMIT 1;
this trick reduces the dataset to 10 records in the worst case scenario, hence the speed is fair good despite the ORDER BY clause. In fact I don't really need to know the Distance from existing point, I just need to know if that distance is above a givel limit (here if is within a 10x10 rectangle that means R=5).

How to add an index to such sql query?

Please tell me how to add an index to this sql query?
SELECT *
FROM table
WHERE (cities IS NULL) AND (position_id = '2') AND (is_pub = '1')
ORDER BY ordering asc
LIMIT 1
Field types:
cities = text
position_id = int(11)
is_pub = tinyint(1)
I try so:
ALTER TABLE table ADD FULLTEXT ( 'cities', 'position_id', 'is_pub' );
But I get an error: The used table type doesn't support FULLTEXT indexes
First, rewrite the query so you are not mixing types. That is, get rid of the single quotes:
SELECT *
FROM table
WHERE (cities IS NULL) AND (position_id = 2) AND (is_pub = 1)
ORDER BY ordering asc
LIMIT 1;
Then, the best query for this is on table(position_id, is_pub, cities, ordering):
create index idx_table_4 on table(position_id, is_pub, cities(32), ordering);
The first three columns can be in any order in the index, so long as they are the first three.
You should change cities to a varchar() type. Is there is reason you want to use a text for this?
You need to change the engine for your table to MyISAM.
possible duplicate of #1214 - The used table type doesn't support FULLTEXT indexes

Need Help Speeding up an Aggregate SQLite Query

I have a table defined like the following...
CREATE table actions (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
end BOOLEAN,
type VARCHAR(15) NOT NULL,
subtype_a VARCHAR(15),
subtype_b VARCHAR(15),
);
I'm trying to query for the last end action of some type to happen on each unique (subtype_a, subtype_b) pair, similar to a group by (except SQLite doesn't say what row is guaranteed to be returned by a group by).
On an SQLite database of about 1MB, the query I have now can take upwards of two seconds, but I need to speed it up to take under a second (since this will be called frequently).
example query:
SELECT * FROM actions a_out
WHERE id =
(SELECT MAX(a_in.id) FROM actions a_in
WHERE a_out.subtype_a = a_in.subtype_a
AND a_out.subtype_b = a_in.subtype_b
AND a_in.status IS NOT NULL
AND a_in.type = "some_type");
If it helps, I know all the unique possibilities for a (subtype_a,subtype_b)
eg:
(a,1)
(a,2)
(b,3)
(b,4)
(b,5)
(b,6)
Beginning with version 3.7.11, SQLite guarantees which record is returned in a group:
Queries of the form: "SELECT max(x), y FROM table" returns the value of y on the same row that contains the maximum x value.
So greatest-n-per-group can be implemented in a much simpler way:
SELECT *, max(id)
FROM actions
WHERE type = 'some_type'
GROUP BY subtype_a, subtype_b
Is this any faster?
select * from actions where id in (select max(id) from actions where type="some_type" group by subtype_a, subtype_b);
This is the greatest-in-per-group problem that comes up frequently on StackOverflow.
Here's how I solve it:
SELECT a_out.* FROM actions a_out
LEFT OUTER JOIN actions a_in ON a_out.subtype_a = a_in.subtype_a
AND a_out.subtype_b = a_in.subtype_b
AND a_out.id < a_in.id
WHERE a_out.type = "some type" AND a_in.id IS NULL
If you have an index on (type, subtype_a, subtype_b, id) this should run very fast.
See also my answers to similar SQL questions:
Fetch the row which has the Max value for a column
Retrieving the last record in each group
SQL join: selecting the last records in a one-to-many relationship
Or this brilliant article by Jan Kneschke: Groupwise Max.

mysql not in issue

I have a select statement that I am trying to build a list of scripts as long as the users role is not in the scripts.sans_role_priority field. This works great if there is only one entry into the field but once I add more than one the whole function quits working. I am sure I am overlooking something simple, just need another set of eyes on it. Any help wold be appreciated.
script:
SELECT *
FROM scripts
WHERE active = 1
AND homePage='Y'
AND (role_priority > 40 OR role_priority = 40)
AND (40 not in (sans_role_priority) )
ORDER BY seq ASC
data in scripts.sans_role_priority(varchar) = "30,40".
Additional testing adds this:
When I switch the values in the field to "40, 30" the select works. Continuing to debug...
Maybe you are looking for FIND_IN_SET().
SELECT *
FROM scripts
WHERE active = 1
AND homePage='Y'
AND (role_priority > 40 OR role_priority = 40)
AND NOT FIND_IN_SET('40', sans_role_priority)
ORDER BY seq ASC
Note that having "X,Y,Z" as VARCHAR values in some fields reveals that your DB schema may be improved in order to have X, Y and Z stored as separate values in a related table.
SELECT *
FROM scripts
WHERE active = 1
AND homePage='Y'
AND role_priority >= 40
AND NOT FIND_IN_SET(40,sans_role_priority)
ORDER BY seq ASC
See: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set
Note that CSV in databases is just about the worst antipattern you can find.
It should be avoided at all costs because:
You cannot use an index on a CSV field (at least not a mentally sane one);
Joins on CSV fields are a major PITA;
Selects on them are uber-slow;
They violate 1NF.
They waste storage.
Instead of using a CSV field, consider putting sans_role_priority in another table with a link back to scripts.
table script_sans_role_priority
-------------------------------
script_id integer foreign key references script(id)
srp integer
primary key (script_id, srp)
Then the renormalized select will be:
SELECT s.*
FROM scripts s
LEFT JOIN script_sans_role_priority srp
ON (s.id = srp.script_id AND srp.srp = 40)
WHERE s.active = 1
AND s.homePage='Y'
AND s.role_priority >= 40
AND srp.script_id IS NULL
ORDER BY seq ASC
SELECT *
FROM scripts
WHERE active = '1'
AND homePage='Y'
AND role_priority >= '40'
AND sans_role_priority <> '40'
ORDER BY seq ASC