I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
mysql:5.7/8.0
table ddl
-- auto-generated definition
create table test_date_index
(
id int auto_increment primary key,
account_id int not null,
remark varchar(10) null,
cal_date date null,
constraint cal_date_index
unique (cal_date, account_id)
);
in this case not using index
explain
select *
from test_date_index
where (account_id, cal_date) in (
select account_id, max(cal_date) from test_date_index group by account_id
);
but work in this case
explain
select *
from test_date_index
where (account_id, cal_date) in (
select account_id, '2022-04-18' from test_date_index group by account_id
)
i think this is because of the type of the cal_date column but i can't find any doc about this
What version are you using? Before 5.7, "row constructors" were not optimized. However, the lack of the optimal index may be the main cause of sluggishness.
For the first query...
Rewrite the "groupwise-max" query thus:
select b.*
FROM ( SELECT account_id, max(cal_date) AS cal_date
from test_date_index
group by account_id ) AS a
JOIN test_date_index AS b USING(account_id, cal_date)
Get promote the UNIQUE index to this:
PRIMARY KEY(account_id, cal_date)
with those columns in that order. Specificaly, account_id needs to be first in order to be useful in the "derived" query (subquery) that I used. Also, it tends to be a better way to organize the table.
Your second query shows that it can use your backward index and that 'row constructors' are optimized in the version you are running.
I've got a very simple query that have to group the results by the field from the joined table:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name
Table ycs_products is actually sales_products, lists products in each sale. I want to see the share of each product sold over a period of time.
The current query speed is 2 seconds which is too much for the user interaction. I need to make this query run fast. Is there a way to get rid of Using temporary without denormalization?
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
here goes the Explain result
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: s
type: range
possible_keys: PRIMARY,dtm
key: dtm
key_len: 6
ref: NULL
rows: 1164728
Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: sales_id
key: sales_id
key_len: 5
ref: test.s.id
rows: 1
Extra:
2 rows in set (0.00 sec)
and the same in json
EXPLAIN: {
"query_block": {
"select_id": 1,
"filesort": {
"sort_key": "p.`name`",
"temporary_table": {
"table": {
"table_name": "s",
"access_type": "range",
"possible_keys": ["PRIMARY", "dtm"],
"key": "dtm",
"key_length": "6",
"used_key_parts": ["dtm"],
"rows": 1164728,
"filtered": 100,
"attached_condition": "s.dtm between '2018-02-16 00:00:00' and '2018-02-22 23:59:59'",
"using_index": true
},
"table": {
"table_name": "p",
"access_type": "ref",
"possible_keys": ["sales_id"],
"key": "sales_id",
"key_length": "5",
"used_key_parts": ["sales_id"],
"ref": ["test.s.id"],
"rows": 1,
"filtered": 100
}
}
}
}
}
as well as create tables though I find it unecessary
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
And also a PHP code to replicate the test environment
#$pdo->query("set global innodb_flush_log_at_trx_commit = 2");
$pdo->query("create table ycs_sales (id int auto_increment primary key, dtm datetime)");
$stmt = $pdo->prepare("insert into ycs_sales values (null, ?)");
foreach (range(mktime(0,0,0,2,1,2018), mktime(0,0,0,2,28,2018)) as $stamp){
$stmt->execute([date("Y-m-d", $stamp)]);
}
$max_id = $pdo->lastInsertId();
$pdo->query("alter table ycs_sales add key(dtm)");
$pdo->query("create table ycs_products (id int auto_increment primary key, sales_id int, name varchar(255))");
$stmt = $pdo->prepare("insert into ycs_products values (null, ?, ?)");
$products = ['food', 'drink', 'vape'];
foreach (range(1, $max_id) as $id){
$stmt->execute([$id, $products[rand(0,2)]]);
}
$pdo->query("alter table ycs_products add key(sales_id)");
The problem is that grouping by name makes you lose the sales_id information, therefore MySQL is forced to use a temporary table.
Although it's not the cleanest of the solutions, and one of my less favorites approach, you could add a new index, on both the name and the sales_id columns, like:
ALTER TABLE `yourdb`.`ycs_products`
ADD INDEX `name_sales_id_idx` (`name` ASC, `sales_id` ASC);
and force the query to use this index, with either force index or use index:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p use index(name_sales_id_idx) ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name;
My execution reported only "using where; using index" on the table p and "using where" on the table s.
Anyway, I strongly suggest you to re-think about your schema, because probably you might find some better design for this two tables. On the other hand, if this is not a critical part of your application, you can deal with the "forced" index.
EDIT
Since it's quite clear that the problem is in the design, I suggest drawing the relationships as a many-to-many. If you have chance to verify it into your testing environment, here's what I would do:
1) Create a temporary table just to store name and id of the product:
create temporary table tmp_prods
select min(id) id, name
from ycs_products
group by name;
2) Starting from the temporary table, join the sales table to create a replacement for the ycs_product:
create table ycs_products_new
select * from tmp_prods;
ALTER TABLE `poc`.`ycs_products_new`
CHANGE COLUMN `id` `id` INT(11) NOT NULL ,
ADD PRIMARY KEY (`id`);
3) Create the join table:
CREATE TABLE `prod_sale` (
`prod_id` INT(11) NOT NULL,
`sale_id` INT(11) NOT NULL,
PRIMARY KEY (`prod_id`, `sale_id`),
INDEX `sale_fk_idx` (`sale_id` ASC),
CONSTRAINT `prod_fk`
FOREIGN KEY (`prod_id`)
REFERENCES ycs_products_new (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `sale_fk`
FOREIGN KEY (`sale_id`)
REFERENCES ycs_sales (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION);
and fill it with the existing values:
insert into prod_sale (prod_id, sale_id)
select tmp_prods.id, sales_id from ycs_sales s
inner join ycs_products p
on p.sales_id=s.id
inner join tmp_prods on tmp_prods.name=p.name;
Finally, the join query:
select name, count(name) from ycs_products_new p
inner join prod_sale ps on ps.prod_id=p.id
inner join ycs_sales s on s.id=ps.sale_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
group by p.id;
Please, note that the group by is on the primary key, not the name.
Explain output:
explain select name, count(name) from ycs_products_new p inner join prod_sale ps on ps.prod_id=p.id inner join ycs_sales s on s.id=ps.sale_id WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59' group by p.id;
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| 1 | SIMPLE | p | index | PRIMARY | PRIMARY | 4 | NULL | 3 | |
| 1 | SIMPLE | ps | ref | PRIMARY,sale_fk_idx | PRIMARY | 4 | test.p.id | 1 | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY,dtm | PRIMARY | 4 | test.ps.sale_id | 1 | Using where |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
Why have an id for ycs_products? It seems like the sales_id should be the PRIMARY KEY of that table?
If that is possible, it eliminates the performance problem by getting rid of the issues brought up by senape.
If, instead, there are multiple rows for each sales_id, then changing the secondary index to this would help:
INDEX(sales_id, name)
Another thing to check on is innodb_buffer_pool_size. It should be about 70% of available RAM. This would improve the cacheablity of data and indexes.
Are there really 1.1 million rows in that one week?
Summary table.
Build and maintain a table that summarizes all sales on a daily basis. It would have the name (denormalized) and date. Hence the table should be smaller than the original data.
The summary table would be something like
CREATE TABLE sales_summary (
dy DATE NOT NULL,
name varchar(255) NOT NULL,
daily_count SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(dy, name),
INDEX(name, dy) -- (You might need this for other queries)
) ENGINE=InnoDB;
The nightly (after midnight) update would be a single query something like the following. It may well take more than 2 seconds, but no user is waiting for it.
INSERT INTO sales_summary (dy, name, one_day_count)
ON DUPLICATE KEY UPDATE
daily_count = daily_count + VALUES(one_day_count)
SELECT DATE(s.dtm) AS dy,
p.name,
COUNT(*) AS one_day_count
FROM ycs_sales s
JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm >= CURDATE() - INTERVAL 1 DAY
AND s.dtm < CURDATE()
GROUP BY 1, 2;
And the user's query will be something like:
SELECT SQL_NO_CACHE
name,
SUM(one_day_count)
FROM sales_summary
WHERE dy >= '2018-02-16'
AND dy < '2018-02-16' + INTERVAL 7 DAY
GROUP BY name;
More discussion of Summary Tables: http://mysql.rjweb.org/doc.php/summarytables
Referring to your below comment I assume filtering by column s.dtm is inevitable.
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
Most crucial action you can take is to observe the frequent search patterns.
For example, if your search criteria for dtm is usually to retrieve whole days' data, i.e. a few days data (say less then 15) and between 00:00:00 and 23:59:59 for all those days, you can use this information to offload your overhead in search time to insert time.
A method to do so; you can add a new column in your table which holds the truncated day data and you can hash index that new column. (In Mysql there is no such concept as a functional index as it does in Oracle. That is why we need to add a new column to imitate that functionality). Something like:
alter table ycs_sales add dtm_truncated date;
delimiter //
create trigger dtm_truncater_insert
before insert on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
delimiter //
create trigger dtm_truncater_update
before update on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
create index index_ycs_sales_dtm_truncated on ycs_sales(dtm_truncated) using hash;
# execute the trigger for existing rows, bypass the safe update mode by id > -1
update ycs_sales set dtm = date(dtm) where id > -1;
Then you can query using the dtm_truncatedfield with the IN command. But of course this has its own tradeoffs, longer ranges will not work. But as I mentioned above in bold, what you can do is to use the new column as a function output that indexes possible searches in the insert / update time.
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm_truncated in ( '2018-02-16', '2018-02-17', '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21', '2018-02-22')
GROUP BY p.name
Additionally assure your key on dtm is a BTREE key. (If it is a hash key, then InnoDB needs to go through all keys.) Generating a BTREE syntax is:
create index index_ycs_sales_dtm on ycs_sales(dtm) using btree;
One final note:
Actually "partitioning pruning" (ref: here) is a concept to partition your data at insert time. But in MySql, I don't know why, partitioning requires related column to be in the primary key. I believe you don't want to add dtmcolumn into the primary key. But if you can do so, then you can also partition your data and get rid of the date range check overhead at the select time.
Not really providing an answer here, but I believe the core of the issue here is nailing down where the real slowdown is happening.
I'm not a MySQL expert, but I would try and run the following queries:
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
SELECT SQL_NO_CACHE COUNT(*) FROM (
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT SQL_NO_CACHE p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
)
SELECT SQL_NO_CACHE s.* FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
SELECT SQL_NO_CACHE COUNT(*) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
When you do, can you tell us how long each one took?
I've run sum test queries on the same data set. And here are my results:
Your query executes in 1.4 seconds.
After adding the covering index on ycs_products(sales_id, name) with
ALTER TABLE `ycs_products`
DROP INDEX `sales_id`,
ADD INDEX `sales_id_name` (`sales_id`, `name`)
the execution time drops to 1.0 second.
I still see "Using temporary; Using filesort" in the EXPLAIN result.
But now there is also "Using index" - Which means, no lookup to the clustered index is needed to get the values of the name column.
Note: I dropped the old index, since it will be redundant for most queries.
But you might have some queries which need that index with id (PK) comming right after sales_id.
You explicitly asked, how to get rid of "Using temporary".
But even if you find a way to force an execution plan, which will avoid the filesort, you wouldn't win much.
Consider the follwing query:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
This one needs 0.855 seconds.
Since there is no GROUP BY clause, no filesort is performed.
It doesn't return the result, that you want -
But the poit is: This is the bottom limit of what you can get, without storing and maintaining redundant data.
If you want to know, where the most time is spent by the engine - Remove the JOIN:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
It executes in 0.155 seconds. So we can conclude: The JOIN is the most expensive part of the query. And you cannot avoid it.
The full list of execution times:
0.155 sec (11%) to read and count 604K rows
0.690 sec (49%) for the JOIN (which you cannot avoid)
0.385 sec (28%) for second lookup (which can be removed with an index)
0.170 sec (12%) for GROUP BY with filesort (which you try to avoid)
So again: "Using temporary; Using filesort" looks bad in the EXPLAIN result - But it's not your biggest problem.
Test environment:
Windows 10 + MariaDB 10.3.13 with innodb_buffer_pool_size = 1G
Test data has been generated with the following script (needs like 1 to 2 min. on a HDD):
drop table if exists ids;
create table ids(id mediumint unsigned auto_increment primary key);
insert into ids(id)
select null as id
from information_schema.COLUMNS c1
, information_schema.COLUMNS c2
, information_schema.COLUMNS c3
limit 2332801 -- 60*60*24*27 + 1;
drop table if exists ycs_sales;
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_sales(id, dtm) select id, date('2018-02-01' + interval (id-1) second) from ids;
drop table if exists ycs_products;
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_products(id, sales_id, name)
select id
, id as sales_id
, case floor(rand(1)*3)
when 0 then 'food'
when 1 then 'drink'
when 2 then 'vape'
end as name
from ids;
I have had similar problems several times. Usually, I'd expect the best results to be obtained with
CREATE INDEX s_date ON ycs_sales(dtm, id)
-- Add a covering index
CREATE INDEX p_name ON ycs_products(sales_id, name);
This ought to get rid of the "the tables are very large" problem, since all information required is now contained in the two indexes. Actually I seem to remember that the first index does not need id if the latter is the primary key.
If this is still not enough, because the two tables are too large, then you have no choice - you must avoid the JOIN. It is already going as fast as it can and if that's not enough, then it has to go.
I believe you can do this with a couple of TRIGGERs to maintain an ancillary daily sales report table (if you never have returned products, then just the one trigger on INSERT in sales will suffice) - try to go with just (product_id, sales_date, sales_count) and JOIN that with the product table to get the name upon output; but, if that is not enough, then use (product_id, product_name, sales_date, sales_count) and periodically update product_name to keep names synced by reading them off the primary table. Since sales_date is now unique and you run searches on that, you can declare sales_date a primary key and partition the ancillary table based on the sales year.
(Once or twice, when partitioning was not possible but I was confident that I would only very rarely cross the "ideal" partition boundary, I partitioned manually - i.e. sales_2012, sales_2013, sales_2014 - and built programmatically a UNION of the two or three years involved, followed by a regroup, resort and secondary totalization stage. Crazy as a March hare, yes, but it worked).
I am developing an application for my college's website and I would like to pull all the events in ascending date order from the database. There is a total of four tables:
Table Events1
event_id, mediumint(8), Unsigned
date, date,
Index -> Primary Key (event_id)
Index -> (date)
Table events_users
event_id, smallint(5), Unsigned
user_id, mediumint(8), Unsigned
Index -> PRIMARY (event_id, user_id)
Table user_bm
link, varchar(26)
user_id, mediumint(8)
Index -> PRIMARY (link, user_id)
Table user_eoc
link, varchar(8)
user_id, mediumint(8)
Index -> Primary (link, user_id)
Query:
EXPLAIN SELECT * FROM events1 E INNER JOIN event_users EU ON E.event_id = EU.event_id
RIGHT JOIN user_eoc EOC ON EU.user_id = EOC.user_id
INNER JOIN user_bm BM ON EOC.user_id = BM.user_id
WHERE E.date >= '2013-01-01' AND E.date <= '2013-01-31'
AND EOC.link = "E690"
AND BM.link like "1.1%"
ORDER BY E.date
EXPLANATION:
The query above does two things.
1) Searches and filters out all students through the user_bm and user_eoc tables. The "link" columns are denormalized columns to quickly filter students by major/year/campus etc.
2) After applying the filter, MYSQL grabs the user_ids of all matching students and finds all events they are attending and outputs them in ascending order.
QUERY OPTIMIZER EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE EOC ref PRIMARY PRIMARY 26 const 47 Using where; Using index; Using temporary; Using f...
1 SIMPLE BM ref PRIMARY,user_id-link user_id-link 3 test.EOC.user_id 1 Using where; Using index
1 SIMPLE EU ref PRIMARY,user_id user_id 3 test.EOC.user_id 1 Using index
1 SIMPLE E eq_ref PRIMARY,date-event_id PRIMARY 3 test.EU.event_id 1 Using where
QUESTION:
The query works fine but can be optimized. Specifically - using filesort and using temporary is costly and I would like to avoid this. I am not sure if this is possible because I would like to 'Order By' events by date that have a 1:n relationship with the matching users. The Order BY applies to a joined table.
Any help or guidance would be greatly appreciated. Thank you and Happy Holidays!
Ordering can be done in two ways. By index or by temporary table. You are ordering by date in table Events1 but it's using the PRIMARY KEY which doesn't contain date so in this case the result needs to be ordered in a temporary table.
It is not necessarily expensive though. If the result is small enough to fit in memory it will not be a temporary table on disk, just in memory and that is not expensive.
Neither is filesort. "Using filesort" doesn't mean it will use any file, it just means it's not sorting by index.
So, if your query executes fast you should be happy. If the result set is small it will be sorted in memory and no files will be created.
I am new to mysql and I have been pulling my hair out about this problem for days. I need to improve/optimize this query so that it runs faster - right now its taking over 5 seconds.
Here is the query:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN stores as s
ON a.username=s.username
WHERE s.username is not null AND s.state='NC'
GROUP BY a.announcement_id
ORDER BY a.dt DESC LIMIT 0,10
Stores table consists of: store_id, username, name, state, city, zip, etc...
Announcements table consists of: announcement_id, msg, dt, username
The stores table has around 10,000 records and the announcements table has around 500,000 records.
What I'm trying to accomplish in english - display the 10 most recent store announcements BUT what makes this complicated is that stores can have multiple entries in the stores table with the same userid (one row per location). So if a chain store, lets say "Chipotle" sends an announcement, I want to display only one row for their announcement with a note next to it that says "this store has multiple locations". That's why I'm using the count(*) and group by, so if count(*) > 1 I know there are multiple locations for the announcement.
The where condition can be any state, city, or zip. Using SQL_NO_CACHE because announcements are updated frequently so you rarely get the same results, does that make sense?
I would really appreciate any suggestions of how to do this better. I know little about indexes, but I did create an index for the "username" field in both tables. Feel free to shred me apart here, I know I must be missing something.
Update --
DESC stores;
Field Type Null Key Default Extra
store_id int(11) NO PRI NULL auto_increment
username varchar(20) NO MUL NULL
name varchar(100) NO NULL
street varchar(100) NO NULL
city varchar(50) NO NULL
state varchar(2) NO NULL
zip varchar(15) NO NULL
DESC announcements;
Field Type Null Key Default Extra
dt datetime NO NULL
username varchar(20) NO MUL NULL
msg varchar(200) NO NULL
announcement_id int(11) NO PRI NULL auto_increment
EXPLAIN output;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index username PRIMARY 47 NULL 315001 Using temporary; Using filesort
1 SIMPLE b ref username username 62 a.username 1 Using where
Try something like this:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN
(
SELECT username, COUNT(username) as multiple FROM stores
WHERE username IS NOT NULL AND state = 'NC'
GROUP BY username
) as s
ON a.username=s.username
ORDER BY a.dt DESC LIMIT 10
If you are ordering on the dt column, but there is no index on that column, the MySQL will have to do a (slow, expensive) sort of all of your result rows on that column every time you run the query
Try adding an index on announcements.dt -- MySQL may be able to access the rows in order by using the index, and avoid the sorting step afterwards.
Change the order of tables in your JOIN, MySQL reads rows from the first table and then
finds matching rows in the second table. If you always filter your result by fields in the stores table then the stores table should be the leading table in your JOIN so it won't pick and sort unnecessary rows from the announcements table.
In the EXPLAIN output you pasted it seems like only one shop matched the query, switching the order of tables would cause it to only look for that specific shop in the announcements table.
Add an index on the dt column (having an indexed integer column with unixtime would be even better)
If possible - create an integer userID for each username and JOIN using that column (add an on index on that one as well)
Not sure if MySQL still has problems with this but replacing COUNT(*) with COUNT(1) might be helpful.