Proper way for setting indexes in query - mysql

I have got 2 tables.
first - table t_games (alias g)
column type
g_id mediumint(8)
t_id_1 smallint(5)
t_id_2 smallint(5)
g_team_1 varchar(50)
g_team_2 varchar(50)
g_date datetime
g_live tinyint(3)
Primary index is set on g_id field and there is additional composite index set on (t_id_1, t_id_2, g_date, g_live) fields.
second - table t_teams (aliases: t1 and t2)
column type
t_id smallint(5)
t_gw_name varchar(50)
gw_cid tinyint(3)
Primary index is set on t_id.
relation between tables updated
There are two teams on each game. In table t_teams there are team's names. In t_games table I keep ID's related to the t_teams, to retreive name of each team taking part in the game. So to retreive a game ID with team's names:
SELECT g.g_id, t1.t_gw_name, t2.t_gw_name FROM t_games g
JOIN t_teams t1 ON (g.t_id_1 = t1.t_id)
JOIN t_teams t2 ON (g.t_id_2 = t2.t_id)
My SQL query:
SELECT g_id, t_id_1, t_id_2, g_team_1, g_team_2, g_date, g_live, t1.t_gw_name AS t_gw_name_1, t1.gw_cid AS gw_cid_1, t2.t_gw_name AS t_gw_name_2, t2.gw_cid AS gw_cid_2
FROM t_games g
JOIN t_teams t1 ON (t_id_1 = t1.t_id) JOIN t_teams t2 ON (t_id_2 = t2.t_id)
WHERE g.g_date < "2013-07-24 20:00:00" AND g.g_live < 2`
And after explain I get:
`
1 SIMPLE g ALL t_id_1 NULL NULL NULL 16 Using where
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 2 t_id_1 1
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 2 t_id_2 1`
I tried many combination of indexing the table, but I can't get rid of the ALL scan.

In your case (for the query you've shown) you only need an index that covers a single column g_date.
Whereas you see ALL because:
There are only 16 rows in the table (?)
You're selecting more than ~30% of rows of the table
On both cases it's easier to scan all the table rather than use index.
So to check that g_date index works:
Fill the t_games table with something like 1000 rows
Perform a query that would return about 10 rows from t_games table
PS:
composite index (g_date, g_live) won't work because you have range comparison for both columns
single g_live won't work be very effective because it's a low cardinality for that column

Related

Selecting Max record in nested Join more efficiently

I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.

How to Optimized performance of JOIN query on large table

I am using Server version: 5.5.28-log MySQL Community Server (GPL).
I have a big table consist of 279703655 records called table A. I have to perform join on this table with one of my changelog table B and then insert matching records in new tmp table C.
B table has index on column type.
A table consist of prod_id,his_id and other columns.A table has index on both column prod_id,history_id.
When i am going to perform the following query
INSERT INTO C(prod,his_id,comm)
SELECT DISTINCT a.product_id,a.history_id,comm
FROM B as b INNER JOIN A as a ON a.his_id = b.his_id AND b.type="applications"
GROUP BY prod_id
ON DUPLICATE KEY UPDATE
`his_id` = VALUES(`his_id`);
it takes 7 to 8 min to insert records.
Even if i perform simple count from table A it took 15 min to give me count.
I have also tried a procedure to insert records in Limit but due to count query takes 15 min it is more slower then before.
BEGIN
DECLARE n INT DEFAULT 0;
DECLARE i INT DEFAULT 0;
SELECT COUNT(*) FROM A INTO n;
SET i=5000000;
WHILE i<n DO
INSERT INTO C(product_id,history_id,comments)
SELECT a.product_id,a.history_id,a.comments FROM B as b
INNER JOIN (SELECT * FROM A LIMIT i,1) as a ON a.history_id=b.history_id;
SET i = i + 5000000;
END WHILE;
End
But the above code is also take 15 to 20 min o execute.
Please suggest me how i make it faster.
Below is EXPLAIN result:
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| 1 | SIMPLE | a | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 279703655 | |
| 1 | SIMPLE | b | eq_ref | PRIMARY | PRIMARY | 8 | DB.a.history_id | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
(from Comment)
CREATE TABLE B (
history_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
history_hash char(32) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
type enum('products','brands','partnames','mc_partnames','applications') NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (history_id),
UNIQUE KEY history_hash (history_hash),
KEY type (type),
KEY stamp (stamp)
);
Let's first look at the tables.
What you call table B is really a history table. Its primary key is the history_id.
What you call table A is really a product table with one product per row and product_id its primary key. Each product also has a history_id. Thus you have created a 1:n relation. A product has one history row; one history row relates to multiple products.
You are selecting the product table rows that have an 'application' type history entry. This should be written as:
select product_id, history_id, comm
from product
where history_id in
(
select history_id
from history
where type = 'applications'
);
(A join would work just as well, but isn't as clear. As there is only one history row per product, you can't get duplicates. Both GROUP BY and DISTINCT are completely superfluous in your query and should be removed in order not to give the DBMS unecessary work to do. But as mentioned: better don't join at all. If you want rows from table A, select from table A. If you want to look up rows in table B, look them up in the WHERE clause, where all criteria belongs.)
Now, we would have to know how many rows may be affected. If only 1% of all history rows are 'applications', then an index should be used. Preferably
create index idx1 on history (type, history_id);
… which finds rows by type and gets their history_id right away.
If, say 20%, of all all history rows are 'applications', then reading the table sequentially might be more efficient.
Then, how many product rows may we get? Even with a single history row, we might get millions of related product rows. Or vice versa, with millions of history rows we might get no product row at all. Again, we can provide an index, which may or may not be used by the DBMS:
create index idx2 on product (history_id, product_id, comm);
This is about as fast as it gets. Two indexes offered and a proper written query without an unnecessary join. There were times when MySQL had performance problems with IN. People rewrote the clause with EXISTS then. I don't think this is still necessary.
As of MySQL 8.0.3, you can create histogram statistics for tables.
analyze history update histogram on type;
analyze product update histogram on history_id;
This is an important step to help the optimizer to find the optimal way to select the data.
Indexes needed (assuming it is history_id, not his_id):
B: INDEX(type, history_id) -- in this order. Note: "covering"
A: INDEX(history_id, product_id, comm)
What column or combination of columns provides the uniqueness constraint that IODKU needs?
Really-- Provide SHOW CREATE TABLE.

Generating a histogram from mysql data

I was wondering if anyone had some advice for me regarding a histogram-generating query. I have a query that I like (in that it works), but it is extremely slow. Here is the background:
I have a table of metadata, a table of data values where one row in meta_data is a key-row for many (perhaps several thousand) rows in data_values, and a table of histogram bin information:
create table meta_data (
id int not null primary key,
name varchar(100),
other_data char(10)
);
create table data_values (
id int not null primary key,
meta_data_id int not null,
data_value real
);
create table histogram_bins (
id int not null primary key,
bin_min real,
bin_max real,
bin_center real,
bin_size real
);
And a query that creates the histogram:
SELECT md.name AS `Name`,
md.other_data AS `OtherData`,
hist.bin_center AS `Bin`,
SUM(data.data_value BETWEEN hist.bin_min AND hist.bin_max) AS `Frequency`
FROM histogram_bins hist
LEFT JOIN data_values data ON 1 = 1
LEFT JOIN meta_data md ON md.id = data.meta_data_id
GROUP BY md.id, `Bin`;
In an earlier version of this query, the BETWEEN ... AND logical statement was down in the JOIN (replacing 1 = 1), but then I would only receive histogram rows with non-zero frequency. I need rows for all of the bins (even the zero-frequency ones), for analysis purposes.
Its pretty darn slow, to the tune of 10-15 minutes or so. The data_values table has about 7.9 million rows, and meta_data weighs in at 15,900 rows -- so maybe it is just going to take a long time!
Thanks very much!
I think this might help
SELECT h.bin_center AS `Bin`,
ISNULL(F.Frequency,0) AS `Frequency`
FROM histogram_bins h
LEFT JOIN
(SELECT hist.bin_center AS `Bin`,
COUNT(data_values) AS `Frequency`
FROM data_values data
LEFT JOIN histogram_bins hist ON data.data_value BETWEEN hist.bin_min AND hist.bin_max
GROUP BY md.name, md.other_data, hist.bin_center) F ON F.bin_center = h.bin_center
I changed the order of the tables because I think it's best to find the corresponding bin for every record in the data and then just count how many there are grouped by bin

mysql left join, limit and sorting

I've a doubt. I need to make a left join between two tables and get only the first result (I mean the first record on table A that doesn't match nothing on table B).
This is an example
create table a (
id int not null auto_increment primary key,
name varchar(50),
surname varchar(50),
prov char(2)
) engine = myisam;
insert into a (name,surname,prov)
values ('aaa','aaa','ss'),('bbb','bbb','ca'),('ccc','ccc','mi'),('ddd','ddd','mi'),('eee','eee','to'),
('fff','fff','mi'),('ggg','ggg','ss'),('hhh','hhh','mi'),('jjj','jjj','ss'),('kkk','kkk','to');
create table b (
id int not null auto_increment primary key,
id_name int
) engine = myisam;
insert into b (id_name) values (3),(4),(8),(5),(10),(1);
Query A:
select a.*
from a
left join b
on a.id = b.id_name
where b.id_name is null and a.prov = 'ss'
order by a.id
limit 1
Query B:
select a.*
from a
left join b
on a.id = b.id_name
where b.id_name is null and a.prov = 'ss'
limit 1
Both queries gives me right result, that is record with id = 7.
I want to know if I can rely on query B even without specifing sorting on id or if it's just a case that I get the right result.
I ask that because on large recordset (more than 10 millions of rows), the query without sorting gives me one record immediately while applying sorting it takes even more than 20 seconds even though a.id is primary key.
Thanks in advance.
You can't rely on query B. Mysql just returned what it found faster to return.
Is there an index on table "b" on column "id_name"? If no, then create it and tell us what You get (I mean how fast) It doesn't matter You are looking for not matched rows, JOIN has to be made before it can test if there is match or not.

mysql optimization - display 10 most recent records, but also identify duplicate rows

I am new to mysql and I have been pulling my hair out about this problem for days. I need to improve/optimize this query so that it runs faster - right now its taking over 5 seconds.
Here is the query:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN stores as s
ON a.username=s.username
WHERE s.username is not null AND s.state='NC'
GROUP BY a.announcement_id
ORDER BY a.dt DESC LIMIT 0,10
Stores table consists of: store_id, username, name, state, city, zip, etc...
Announcements table consists of: announcement_id, msg, dt, username
The stores table has around 10,000 records and the announcements table has around 500,000 records.
What I'm trying to accomplish in english - display the 10 most recent store announcements BUT what makes this complicated is that stores can have multiple entries in the stores table with the same userid (one row per location). So if a chain store, lets say "Chipotle" sends an announcement, I want to display only one row for their announcement with a note next to it that says "this store has multiple locations". That's why I'm using the count(*) and group by, so if count(*) > 1 I know there are multiple locations for the announcement.
The where condition can be any state, city, or zip. Using SQL_NO_CACHE because announcements are updated frequently so you rarely get the same results, does that make sense?
I would really appreciate any suggestions of how to do this better. I know little about indexes, but I did create an index for the "username" field in both tables. Feel free to shred me apart here, I know I must be missing something.
Update --
DESC stores;
Field Type Null Key Default Extra
store_id int(11) NO PRI NULL auto_increment
username varchar(20) NO MUL NULL
name varchar(100) NO NULL
street varchar(100) NO NULL
city varchar(50) NO NULL
state varchar(2) NO NULL
zip varchar(15) NO NULL
DESC announcements;
Field Type Null Key Default Extra
dt datetime NO NULL
username varchar(20) NO MUL NULL
msg varchar(200) NO NULL
announcement_id int(11) NO PRI NULL auto_increment
EXPLAIN output;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index username PRIMARY 47 NULL 315001 Using temporary; Using filesort
1 SIMPLE b ref username username 62 a.username 1 Using where
Try something like this:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN
(
SELECT username, COUNT(username) as multiple FROM stores
WHERE username IS NOT NULL AND state = 'NC'
GROUP BY username
) as s
ON a.username=s.username
ORDER BY a.dt DESC LIMIT 10
If you are ordering on the dt column, but there is no index on that column, the MySQL will have to do a (slow, expensive) sort of all of your result rows on that column every time you run the query
Try adding an index on announcements.dt -- MySQL may be able to access the rows in order by using the index, and avoid the sorting step afterwards.
Change the order of tables in your JOIN, MySQL reads rows from the first table and then
finds matching rows in the second table. If you always filter your result by fields in the stores table then the stores table should be the leading table in your JOIN so it won't pick and sort unnecessary rows from the announcements table.
In the EXPLAIN output you pasted it seems like only one shop matched the query, switching the order of tables would cause it to only look for that specific shop in the announcements table.
Add an index on the dt column (having an indexed integer column with unixtime would be even better)
If possible - create an integer userID for each username and JOIN using that column (add an on index on that one as well)
Not sure if MySQL still has problems with this but replacing COUNT(*) with COUNT(1) might be helpful.