I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
i think i've optimized what i could for the following tables structure:
CREATE TABLE `sal_forwarding` (
`sid` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`f_shop` INT(11) NOT NULL,
`f_offer` INT(11) DEFAULT NULL,
.
.
.
.
.
`f_affiliateId` TINYINT(3) UNSIGNED NOT NULL,
`forwardDate` DATE NOT NULL,
PRIMARY KEY (`sid`),
KEY `f_partner` (`f_partner`,`forwardDate`),
KEY `forwardDate` (`forwardDate`,`cid`),
KEY `forwardDate_2` (`forwardDate`,`f_shop`),
KEY `forwardDate_3` (`forwardDate`,`f_shop`,`f_partner`),
KEY `forwardDate_4` (`forwardDate`,`f_partner`,`cid`),
KEY `forwardDate_5` (`forwardDate`,`f_affiliateId`),
KEY `forwardDate_6` (`forwardDate`,`f_shop`,`sid`),
KEY `forwardDate_7` (`forwardDate`,`f_shop`,`cid`),
KEY `forwardDate_8` (`forwardDate`,`f_affiliateId`,`cid`)
) ENGINE=INNODB AUTO_INCREMENT=10946560 DEFAULT CHARSET=latin1
This is the explain Statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE sal_forwarding range forwardDate,forwardDate_2,forwardDate_3,forwardDate_4,forwardDate_5,forwardDate_6,forwardDate_7,forwardDate_8 forwardDate_7 3 (NULL) 1221784 Using where; Using index; Using filesort
The following Query needs 23 seconds for reading 2300 rows:
SELECT COUNT(sid),f_shop, COUNT(DISTINCT(cid))
FROM sal_forwarding
WHERE forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
GROUP BY f_shop
What can i do to improve the performance?
Thank you very much.
slight modification to what you had... use count(*) instead of an actual field. for the DISTINCT, you don't need () around it. It may be getting confused about all the indexes you have. Remove all other indexes on forwardDate with exception to having one based on (forwardDate, f_shop, cid ) (your current key7 index)
SELECT
COUNT(*),
f_shop,
COUNT(DISTINCT cid )
FROM
sal_forwarding
WHERE
forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
GROUP BY
f_shop
Then, for grins, and since nothing else appears to be working for you, try putting in a pre-subquery on the records, then sum from that, so it's not relying on any other index pages based on your near 11 million records (implied per Auto-increment value)...
SELECT
f_shop,
sum( PreQuery.Presum) totalCnt,
COUNT(*) dist_cid
FROM
( select f_shop, cid, count(*) presum
from sal_forwarding
WHERE forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
group by f_shop, cid ) PreQuery
GROUP BY
f_shop
Since the inner pre-query is doing a simple count of records and grouping by F_Shop and C_ID (optimizable by the index), you will now have your distinct already rolled-up via a simple count... then do a SUM() of the inner count's "presum" column. Again, just another option to try and turn the tables, hope it works for you.
I don't think the (forwardDate, f_shop, cid) is good for this query. Not any better than a simple (forwardDate) index, because of the range condition on the forwardDate column.
You may try a (f_shop, cid, forwardDate) index.
I wanted to simulate large number of data in a database and test how my query would perform under such conditions. I was not surprised when query turned out to be slow. So here I am seeking advice on how I could better index my tables and improve my queries.
Before I post tables's sql and the query I use, Let me explain what is what. I have a user's table, which is populated by 100 000 records. Most of the columns in it are enum type, like hair color, looking_for, etc... The first query I have is generated when a search is done. The query would consist of a where statement where some or all column values are searched for and only ids are retrieved limited by 20.
Then I have 3 more tables that hold about 50 - 1000 records per each user, so numbers could really grow. these tables hold information on who visited who's profile, who marked who as a favorite, who blocked who, and messaging table. My goal is to retrieve 20 records that match the search criteria, but also determine if I (user who's browsing) have:
blocked them
favorited them
was favorited by them
have unread messages from them
have sent or received any messages from them
For this I tried using both joins and subqueries, but the problem is that second query that retrieves users and data listed above is still slow. I think I need a better index and possibly a better queries. here is what I have right now, tables definitions first and 2 queries in the end. First des sarch and determiens IDs, second uses ids from first query to retrieve data. I hope you guys can help me create better indexes and optimize the query.
CREATE TABLE user (id BIGINT AUTO_INCREMENT, dname VARCHAR(255) NOT NULL, email VARCHAR(255) NOT NULL UNIQUE, email_code VARCHAR(255), email_confirmed TINYINT(1) DEFAULT '0', password VARCHAR(255) NOT NULL, gender ENUM('male', 'female'), description TEXT, dob DATE, height MEDIUMINT, looks ENUM('thin', 'average', 'athletic', 'heavy'), looking_for ENUM('marriage', 'dating', 'friends'), looking_for_age1 BIGINT, looking_for_age2 BIGINT, color_hair ENUM('black', 'brown', 'blond', 'red'), color_eyes ENUM('black', 'brown', 'blue', 'green', 'grey'), marital_status ENUM('single', 'married', 'divorced', 'widowed'), smokes ENUM('no', 'yes', 'sometimes'), drinks ENUM('no', 'yes', 'sometimes'), has_children ENUM('no', 'yes'), wants_children ENUM('no', 'yes'), education ENUM('school', 'college', 'university', 'masters', 'phd'), occupation ENUM('no', 'yes'), country_id BIGINT, city_id BIGINT, lastlogin_at DATETIME, deleted_at DATETIME, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX country_id_idx (country_id), INDEX city_id_idx (city_id), INDEX image_id_idx (image_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE block (id BIGINT AUTO_INCREMENT, blocker_id BIGINT, blocked_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX blocker_id_idx (blocker_id), INDEX blocked_id_idx (blocked_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE city (id BIGINT AUTO_INCREMENT, name_eng VARCHAR(30), name_geo VARCHAR(30), name_geo_shi VARCHAR(30), name_geo_is VARCHAR(30), country_id BIGINT NOT NULL, active TINYINT(1) DEFAULT '0', INDEX country_id_idx (country_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE country (id BIGINT AUTO_INCREMENT, code VARCHAR(2), name_eng VARCHAR(30), name_geo VARCHAR(30), name_geo_shi VARCHAR(30), name_geo_is VARCHAR(30), active TINYINT(1) DEFAULT '1', PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE favorite (id BIGINT AUTO_INCREMENT, favoriter_id BIGINT, favorited_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX favoriter_id_idx (favoriter_id), INDEX favorited_id_idx (favorited_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE message (id BIGINT AUTO_INCREMENT, body TEXT, sender_id BIGINT, receiver_id BIGINT, read_at DATETIME, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX sender_id_idx (sender_id), INDEX receiver_id_idx (receiver_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
CREATE TABLE visitor (id BIGINT AUTO_INCREMENT, visitor_id BIGINT, visited_id BIGINT, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, INDEX visitor_id_idx (visitor_id), INDEX visited_id_idx (visited_id), PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci ENGINE = INNODB;
SELECT s.id AS s__id FROM user s WHERE (s.gender = 'female' AND s.marital_status = 'single' AND s.smokes = 'no' AND s.deleted_at IS NULL) LIMIT 20
SELECT s.id AS s__id, s.dname AS s__dname, s.gender AS s__gender, s.height AS s__height, s.dob AS s__dob, s3.id AS s3__id, s3.code AS s3__code, s3.name_geo AS s3__name_geo, s4.id AS s4__id, s4.name_geo AS s4__name_geo, s5.id AS s5__id, s6.id AS s6__id, s7.id AS s7__id, s8.id AS s8__id, s9.id AS s9__id FROM user s LEFT JOIN country s3 ON s.country_id = s3.id LEFT JOIN city s4 ON s.city_id = s4.id LEFT JOIN block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1'))) WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399')) GROUP BY s.id ORDER BY s.id
Here are the results of EXPLAIN of the 2 queries above:
First:
1 SIMPLE s ALL NULL NULL NULL NULL 100420 Using Where
Second:
1 SIMPLE s range PRIMARY PRIMARY 8 NULL 20 Using where; Using temporary; Using filesort
1 SIMPLE s2 eq_ref PRIMARY PRIMARY 8 sagule.s.image_id 1 Using index
1 SIMPLE s3 eq_ref PRIMARY PRIMARY 8 sagule.s.country_id 1
1 SIMPLE s4 eq_ref PRIMARY PRIMARY 8 sagule.s.city_id 1
1 SIMPLE s5 ref blocker_id_idx,blocked_id_idx blocked_id_idx 9 sagule.s.id 5
1 SIMPLE s6 ref favoriter_id_idx,favorited_id_idx favorited_id_idx 9 sagule.s.id 6
1 SIMPLE s7 ref favoriter_id_idx,favorited_id_idx favoriter_id_idx 9 sagule.s.id 6
1 SIMPLE s8 ref sender_id_idx,receiver_id_idx sender_id_idx 9 sagule.s.id 7
1 SIMPLE s9 index_merge sender_id_idx,receiver_id_idx receiver_id_idx,sender_id_idx 9,9 NULL 66 Using union(receiver_id_idx,sender_id_idx); Using where
I'm a MSSQL guy and havent used mysql but the concepts should be the same.
Firstly can you remove the group and order by and comment out all tables except for the first one "user". Also comment out any columns of the removed tables. As I have below.
SELECT s.id AS s__id,
s.dname AS s__dname,
s.gender AS s__gender,
s.height AS s__height,
s.dob AS s__dob
-- s3.id AS s3__id,
-- s3.code AS s3__code,
-- s3.name_geo AS s3__name_geo,
-- s4.id AS s4__id,
-- s4.name_geo AS s4__name_geo,
-- s5.id AS s5__id,
-- s6.id AS s6__id,
-- s7.id AS s7__id,
-- s8.id AS s8__id,
-- s9.id AS s9__id
FROM user s --LEFT JOIN
-- country s3 ON s.country_id = s3.id LEFT JOIN
-- city s4 ON s.city_id = s4.id LEFT JOIN
-- block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN
-- favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN
-- favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN
-- message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN
-- message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1')))
WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399'))
Run the query and record the time. Then add one table and its columns back in at a time and run it until you find which one causes it to slow significantly.
SELECT s.id AS s__id,
s.dname AS s__dname,
s.gender AS s__gender,
s.height AS s__height,
s.dob AS s__dob,
s3.id AS s3__id,
s3.code AS s3__code,
s3.name_geo AS s3__name_geo
-- s4.id AS s4__id,
-- s4.name_geo AS s4__name_geo,
-- s5.id AS s5__id,
-- s6.id AS s6__id,
-- s7.id AS s7__id,
-- s8.id AS s8__id,
-- s9.id AS s9__id
FROM user s LEFT JOIN
country s3 ON s.country_id = s3.id --LEFT JOIN
-- city s4 ON s.city_id = s4.id LEFT JOIN
-- block s5 ON ((s.id = s5.blocked_id AND s5.blocker_id = '1')) LEFT JOIN
-- favorite s6 ON ((s.id = s6.favorited_id AND s6.favoriter_id = '1')) LEFT JOIN
-- favorite s7 ON ((s.id = s7.favoriter_id AND s7.favorited_id = '1')) LEFT JOIN
-- message s8 ON ((s.id = s8.sender_id AND s8.receiver_id = '1' AND s8.read_at IS NULL)) LEFT JOIN
-- message s9 ON (((s.id = s9.sender_id AND s9.receiver_id = '1') OR (s.id = s9.receiver_id AND s9.sender_id = '1')))
WHERE (s.id IN ('22', '36', '53', '105', '152', '156', '169', '182', '186', '192', '201', '215', '252', '287', '288', '321', '330', '351', '366', '399'))
My guess is that it would be the block and both favorites and message joins that is giving you the performance hit (the one with the most rows will be the biggest hit).
For the block table, Can you remove one of the indexes and change the other to be something along the lines of (I am not sure of the syntax but you'll get the point)
INDEX blocker_id_idx (blocker_id,blocked_id),
and try it with the columns order swapped around to find witch order is best for your query
INDEX blocker_id_idx (blocked_id,blocker_id),
For the favorite table, change the indexes to
INDEX favoriter_id_idx (favoriter_id,favorited_id),
INDEX favorited_id_idx (favorited_id,favoriter_id),
Again try it with the columns swapped around to find which give better performance.
Do the same for the message indexes.
Do that and let me know if things improved. There are a few other things that can be done to improve it further. - EDIT: It seams I lied about the few other things, what I had intended would not have made any difference. But I can speed up your first query which is below.
EDIT This is for your first select query.
This one is a bit long, but I wanted to show you how indexes work so you can make your own.
Lets say the table contains 100,000 rows.
When you select from it, this is the general process it will take.
Are there any indexes that cover or
mostly cover the columns that I need.
(I your case, no there isn't.)
So use Primary Index and scan though
every row in the table to check for a
match.
Every row in the table will need to
be read from disk to find which
columns match you criteria. So to
return the approx 10,000 rows (this
is a guess) that match you data the
database engine has read all 100,000
rows.
You do have a top 20 in you query, so it will limit the amount of rows the engine will read from disk.
Example
read row 1: is match so add to result
read row 2: no match - skip
read row 3: no match - skip
read row 4: is match so add to
result.
stop after 20 rows identified
You potentially read about 5000 rows from disk to return 20.
We need to create an index that will help us read as few records as possible from the table/disk, but still get the rows we are after. So here goes.
Your query uses 4 filters to get to the data.
s.gender = 'female' AND
s.marital_status = 'single' AND
s.smokes = 'no' AND
s.deleted_at IS NULL
What we need to do now is identify which filter by itself will return the least amount of rows. I cant tell as I don't have any data, but this is what I would guess to be in your table.
The gender column support 2 values and it would be fair to estimate that half of the records in your database are male and the other female, so that filter you need will return approx 50,000 rows.
Now for marital status, supports four values, so if we say the data has an equal spread, it would mean we would get roughly 25,000 rows back. Of course, it depends on th actual data and I would say, that there are not too many widowed in the data, so a better estimate may be 30% share between the other three. So lets say 30,000 records marked as single.
Now for the smokes column. I have read that here in Australia about 10% of people smoke which is a fairly low number compared to other countries. So lets say 25% either smoke or smoke sometimes. That leaves us with approx 75,000 non smokers.
Now for the last column, deleted. A fair guess on my part but lets say 5% are marked as deleted. That leaves us with approx 95,000 rows.
So in summary (remember, this is all pure guess work on my part, your data may be different)
Gender 50,000 rows or 50%
Marital status 30,000 rows or 30%
Smokes 75,000 rows or 75%
Deleted 95,000 rows or 95%
So if we create an index with the four columns using the one that returns the least amount of rows first, we would get the following
INDEX index01_idx (marital_status,gender,smokes,deleted_at),
Now this is what will happen when we run the select.
The server will find an index that
covers all the columns in the WHERE
clause
It will narrow down the result set to
30,000 "single" records.
Of those 30,000, 50% will be female
that leaves 15,000 records
Of those 15,000, 75% will not smoke
that leaves 11,250 records
Of those 11,250, 95% will not be
deleted,
That leaves us with just over 10,000 records out of 100,000 total that we have identified as the records we want but not yet read from disk. You also have a limit 20 in the query so the database engine just needs to read the first 20 of the 10,000 and return the result. Its super quick, the hard disk will love you and the scary DBA will even mumble and grunt with approval.
In your second SELECT query, you can remove the GROUP BY clause because you aren't using any Aggregate functions (count, min, max...) in your SELECT clause.
I doubt this will help much improving performance, though.
In any case, I recommend to watch the first half of this talk "A Look into a MySQL DBA's Toolchest".
(The first two thirds of the video are about free open-source admin-tools for mysql on Unix, the last third or so is about replication)
Video A Look into a MySQL DBA's Toolchest
From the same talk: The Guide To Understanding mysqlreport
Without some Data to Test, it is not so easy to make a good advice.
Generating an Index for fields that are searched frequently, can help make your query faster. But with an Index your Inserts and Updates can get slower. You have to think about the tradeoff. So index the Columns that get searched freqeuently, but test the new Index on the Data so you can see if it runs faster.
I don't know which Tools you are using, but with the MySQL Workbench there is a Command "Explain Current Statement" under the "Query"-Menu. There you can see which actions were done by MySQL and which keys were used. Your Query shows "null" which means no key was used and MySQL had to run through the whole data comparing with the search term.
Hope this helps a bit.