MYSQL: Optimize Order By in Table Sort - mysql

I am developing an application for my college's website and I would like to pull all the events in ascending date order from the database. There is a total of four tables:
Table Events1
event_id, mediumint(8), Unsigned
date, date,
Index -> Primary Key (event_id)
Index -> (date)
Table events_users
event_id, smallint(5), Unsigned
user_id, mediumint(8), Unsigned
Index -> PRIMARY (event_id, user_id)
Table user_bm
link, varchar(26)
user_id, mediumint(8)
Index -> PRIMARY (link, user_id)
Table user_eoc
link, varchar(8)
user_id, mediumint(8)
Index -> Primary (link, user_id)
Query:
EXPLAIN SELECT * FROM events1 E INNER JOIN event_users EU ON E.event_id = EU.event_id
RIGHT JOIN user_eoc EOC ON EU.user_id = EOC.user_id
INNER JOIN user_bm BM ON EOC.user_id = BM.user_id
WHERE E.date >= '2013-01-01' AND E.date <= '2013-01-31'
AND EOC.link = "E690"
AND BM.link like "1.1%"
ORDER BY E.date
EXPLANATION:
The query above does two things.
1) Searches and filters out all students through the user_bm and user_eoc tables. The "link" columns are denormalized columns to quickly filter students by major/year/campus etc.
2) After applying the filter, MYSQL grabs the user_ids of all matching students and finds all events they are attending and outputs them in ascending order.
QUERY OPTIMIZER EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE EOC ref PRIMARY PRIMARY 26 const 47 Using where; Using index; Using temporary; Using f...
1 SIMPLE BM ref PRIMARY,user_id-link user_id-link 3 test.EOC.user_id 1 Using where; Using index
1 SIMPLE EU ref PRIMARY,user_id user_id 3 test.EOC.user_id 1 Using index
1 SIMPLE E eq_ref PRIMARY,date-event_id PRIMARY 3 test.EU.event_id 1 Using where
QUESTION:
The query works fine but can be optimized. Specifically - using filesort and using temporary is costly and I would like to avoid this. I am not sure if this is possible because I would like to 'Order By' events by date that have a 1:n relationship with the matching users. The Order BY applies to a joined table.
Any help or guidance would be greatly appreciated. Thank you and Happy Holidays!

Ordering can be done in two ways. By index or by temporary table. You are ordering by date in table Events1 but it's using the PRIMARY KEY which doesn't contain date so in this case the result needs to be ordered in a temporary table.
It is not necessarily expensive though. If the result is small enough to fit in memory it will not be a temporary table on disk, just in memory and that is not expensive.
Neither is filesort. "Using filesort" doesn't mean it will use any file, it just means it's not sorting by index.
So, if your query executes fast you should be happy. If the result set is small it will be sorted in memory and no files will be created.

Related

Optimize a query that group results by a field from the joined table

I've got a very simple query that have to group the results by the field from the joined table:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name
Table ycs_products is actually sales_products, lists products in each sale. I want to see the share of each product sold over a period of time.
The current query speed is 2 seconds which is too much for the user interaction. I need to make this query run fast. Is there a way to get rid of Using temporary without denormalization?
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
here goes the Explain result
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: s
type: range
possible_keys: PRIMARY,dtm
key: dtm
key_len: 6
ref: NULL
rows: 1164728
Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: sales_id
key: sales_id
key_len: 5
ref: test.s.id
rows: 1
Extra:
2 rows in set (0.00 sec)
and the same in json
EXPLAIN: {
"query_block": {
"select_id": 1,
"filesort": {
"sort_key": "p.`name`",
"temporary_table": {
"table": {
"table_name": "s",
"access_type": "range",
"possible_keys": ["PRIMARY", "dtm"],
"key": "dtm",
"key_length": "6",
"used_key_parts": ["dtm"],
"rows": 1164728,
"filtered": 100,
"attached_condition": "s.dtm between '2018-02-16 00:00:00' and '2018-02-22 23:59:59'",
"using_index": true
},
"table": {
"table_name": "p",
"access_type": "ref",
"possible_keys": ["sales_id"],
"key": "sales_id",
"key_length": "5",
"used_key_parts": ["sales_id"],
"ref": ["test.s.id"],
"rows": 1,
"filtered": 100
}
}
}
}
}
as well as create tables though I find it unecessary
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
And also a PHP code to replicate the test environment
#$pdo->query("set global innodb_flush_log_at_trx_commit = 2");
$pdo->query("create table ycs_sales (id int auto_increment primary key, dtm datetime)");
$stmt = $pdo->prepare("insert into ycs_sales values (null, ?)");
foreach (range(mktime(0,0,0,2,1,2018), mktime(0,0,0,2,28,2018)) as $stamp){
$stmt->execute([date("Y-m-d", $stamp)]);
}
$max_id = $pdo->lastInsertId();
$pdo->query("alter table ycs_sales add key(dtm)");
$pdo->query("create table ycs_products (id int auto_increment primary key, sales_id int, name varchar(255))");
$stmt = $pdo->prepare("insert into ycs_products values (null, ?, ?)");
$products = ['food', 'drink', 'vape'];
foreach (range(1, $max_id) as $id){
$stmt->execute([$id, $products[rand(0,2)]]);
}
$pdo->query("alter table ycs_products add key(sales_id)");
The problem is that grouping by name makes you lose the sales_id information, therefore MySQL is forced to use a temporary table.
Although it's not the cleanest of the solutions, and one of my less favorites approach, you could add a new index, on both the name and the sales_id columns, like:
ALTER TABLE `yourdb`.`ycs_products`
ADD INDEX `name_sales_id_idx` (`name` ASC, `sales_id` ASC);
and force the query to use this index, with either force index or use index:
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p use index(name_sales_id_idx) ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
GROUP BY p.name;
My execution reported only "using where; using index" on the table p and "using where" on the table s.
Anyway, I strongly suggest you to re-think about your schema, because probably you might find some better design for this two tables. On the other hand, if this is not a critical part of your application, you can deal with the "forced" index.
EDIT
Since it's quite clear that the problem is in the design, I suggest drawing the relationships as a many-to-many. If you have chance to verify it into your testing environment, here's what I would do:
1) Create a temporary table just to store name and id of the product:
create temporary table tmp_prods
select min(id) id, name
from ycs_products
group by name;
2) Starting from the temporary table, join the sales table to create a replacement for the ycs_product:
create table ycs_products_new
select * from tmp_prods;
ALTER TABLE `poc`.`ycs_products_new`
CHANGE COLUMN `id` `id` INT(11) NOT NULL ,
ADD PRIMARY KEY (`id`);
3) Create the join table:
CREATE TABLE `prod_sale` (
`prod_id` INT(11) NOT NULL,
`sale_id` INT(11) NOT NULL,
PRIMARY KEY (`prod_id`, `sale_id`),
INDEX `sale_fk_idx` (`sale_id` ASC),
CONSTRAINT `prod_fk`
FOREIGN KEY (`prod_id`)
REFERENCES ycs_products_new (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `sale_fk`
FOREIGN KEY (`sale_id`)
REFERENCES ycs_sales (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION);
and fill it with the existing values:
insert into prod_sale (prod_id, sale_id)
select tmp_prods.id, sales_id from ycs_sales s
inner join ycs_products p
on p.sales_id=s.id
inner join tmp_prods on tmp_prods.name=p.name;
Finally, the join query:
select name, count(name) from ycs_products_new p
inner join prod_sale ps on ps.prod_id=p.id
inner join ycs_sales s on s.id=ps.sale_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
group by p.id;
Please, note that the group by is on the primary key, not the name.
Explain output:
explain select name, count(name) from ycs_products_new p inner join prod_sale ps on ps.prod_id=p.id inner join ycs_sales s on s.id=ps.sale_id WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59' group by p.id;
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| 1 | SIMPLE | p | index | PRIMARY | PRIMARY | 4 | NULL | 3 | |
| 1 | SIMPLE | ps | ref | PRIMARY,sale_fk_idx | PRIMARY | 4 | test.p.id | 1 | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY,dtm | PRIMARY | 4 | test.ps.sale_id | 1 | Using where |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
Why have an id for ycs_products? It seems like the sales_id should be the PRIMARY KEY of that table?
If that is possible, it eliminates the performance problem by getting rid of the issues brought up by senape.
If, instead, there are multiple rows for each sales_id, then changing the secondary index to this would help:
INDEX(sales_id, name)
Another thing to check on is innodb_buffer_pool_size. It should be about 70% of available RAM. This would improve the cacheablity of data and indexes.
Are there really 1.1 million rows in that one week?
Summary table.
Build and maintain a table that summarizes all sales on a daily basis. It would have the name (denormalized) and date. Hence the table should be smaller than the original data.
The summary table would be something like
CREATE TABLE sales_summary (
dy DATE NOT NULL,
name varchar(255) NOT NULL,
daily_count SMALLINT UNSIGNED NOT NULL,
PRIMARY KEY(dy, name),
INDEX(name, dy) -- (You might need this for other queries)
) ENGINE=InnoDB;
The nightly (after midnight) update would be a single query something like the following. It may well take more than 2 seconds, but no user is waiting for it.
INSERT INTO sales_summary (dy, name, one_day_count)
ON DUPLICATE KEY UPDATE
daily_count = daily_count + VALUES(one_day_count)
SELECT DATE(s.dtm) AS dy,
p.name,
COUNT(*) AS one_day_count
FROM ycs_sales s
JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm >= CURDATE() - INTERVAL 1 DAY
AND s.dtm < CURDATE()
GROUP BY 1, 2;
And the user's query will be something like:
SELECT SQL_NO_CACHE
name,
SUM(one_day_count)
FROM sales_summary
WHERE dy >= '2018-02-16'
AND dy < '2018-02-16' + INTERVAL 7 DAY
GROUP BY name;
More discussion of Summary Tables: http://mysql.rjweb.org/doc.php/summarytables
Referring to your below comment I assume filtering by column s.dtm is inevitable.
The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.
Most crucial action you can take is to observe the frequent search patterns.
For example, if your search criteria for dtm is usually to retrieve whole days' data, i.e. a few days data (say less then 15) and between 00:00:00 and 23:59:59 for all those days, you can use this information to offload your overhead in search time to insert time.
A method to do so; you can add a new column in your table which holds the truncated day data and you can hash index that new column. (In Mysql there is no such concept as a functional index as it does in Oracle. That is why we need to add a new column to imitate that functionality). Something like:
alter table ycs_sales add dtm_truncated date;
delimiter //
create trigger dtm_truncater_insert
before insert on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
delimiter //
create trigger dtm_truncater_update
before update on ycs_sales
for each row
set new.dtm_truncated = date(new.dtm);
//
create index index_ycs_sales_dtm_truncated on ycs_sales(dtm_truncated) using hash;
# execute the trigger for existing rows, bypass the safe update mode by id > -1
update ycs_sales set dtm = date(dtm) where id > -1;
Then you can query using the dtm_truncatedfield with the IN command. But of course this has its own tradeoffs, longer ranges will not work. But as I mentioned above in bold, what you can do is to use the new column as a function output that indexes possible searches in the insert / update time.
SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm_truncated in ( '2018-02-16', '2018-02-17', '2018-02-18', '2018-02-19', '2018-02-20', '2018-02-21', '2018-02-22')
GROUP BY p.name
Additionally assure your key on dtm is a BTREE key. (If it is a hash key, then InnoDB needs to go through all keys.) Generating a BTREE syntax is:
create index index_ycs_sales_dtm on ycs_sales(dtm) using btree;
One final note:
Actually "partitioning pruning" (ref: here) is a concept to partition your data at insert time. But in MySql, I don't know why, partitioning requires related column to be in the primary key. I believe you don't want to add dtmcolumn into the primary key. But if you can do so, then you can also partition your data and get rid of the date range check overhead at the select time.
Not really providing an answer here, but I believe the core of the issue here is nailing down where the real slowdown is happening.
I'm not a MySQL expert, but I would try and run the following queries:
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
SELECT SQL_NO_CACHE COUNT(*) FROM (
SELECT SQL_NO_CACHE name, count(*) FROM (
SELECT SQL_NO_CACHE p.name FROM ycs_sales s INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59')
GROUP BY name
)
SELECT SQL_NO_CACHE s.* FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
SELECT SQL_NO_CACHE COUNT(*) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
When you do, can you tell us how long each one took?
I've run sum test queries on the same data set. And here are my results:
Your query executes in 1.4 seconds.
After adding the covering index on ycs_products(sales_id, name) with
ALTER TABLE `ycs_products`
DROP INDEX `sales_id`,
ADD INDEX `sales_id_name` (`sales_id`, `name`)
the execution time drops to 1.0 second.
I still see "Using temporary; Using filesort" in the EXPLAIN result.
But now there is also "Using index" - Which means, no lookup to the clustered index is needed to get the values of the name column.
Note: I dropped the old index, since it will be redundant for most queries.
But you might have some queries which need that index with id (PK) comming right after sales_id.
You explicitly asked, how to get rid of "Using temporary".
But even if you find a way to force an execution plan, which will avoid the filesort, you wouldn't win much.
Consider the follwing query:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
This one needs 0.855 seconds.
Since there is no GROUP BY clause, no filesort is performed.
It doesn't return the result, that you want -
But the poit is: This is the bottom limit of what you can get, without storing and maintaining redundant data.
If you want to know, where the most time is spent by the engine - Remove the JOIN:
SELECT SQL_NO_CACHE COUNT(1) FROM ycs_sales s
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND '2018-02-22 23:59:59'
It executes in 0.155 seconds. So we can conclude: The JOIN is the most expensive part of the query. And you cannot avoid it.
The full list of execution times:
0.155 sec (11%) to read and count 604K rows
0.690 sec (49%) for the JOIN (which you cannot avoid)
0.385 sec (28%) for second lookup (which can be removed with an index)
0.170 sec (12%) for GROUP BY with filesort (which you try to avoid)
So again: "Using temporary; Using filesort" looks bad in the EXPLAIN result - But it's not your biggest problem.
Test environment:
Windows 10 + MariaDB 10.3.13 with innodb_buffer_pool_size = 1G
Test data has been generated with the following script (needs like 1 to 2 min. on a HDD):
drop table if exists ids;
create table ids(id mediumint unsigned auto_increment primary key);
insert into ids(id)
select null as id
from information_schema.COLUMNS c1
, information_schema.COLUMNS c2
, information_schema.COLUMNS c3
limit 2332801 -- 60*60*24*27 + 1;
drop table if exists ycs_sales;
CREATE TABLE `ycs_sales` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`dtm` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dtm` (`dtm`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_sales(id, dtm) select id, date('2018-02-01' + interval (id-1) second) from ids;
drop table if exists ycs_products;
CREATE TABLE `ycs_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sales_id` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `sales_id` (`sales_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
insert into ycs_products(id, sales_id, name)
select id
, id as sales_id
, case floor(rand(1)*3)
when 0 then 'food'
when 1 then 'drink'
when 2 then 'vape'
end as name
from ids;
I have had similar problems several times. Usually, I'd expect the best results to be obtained with
CREATE INDEX s_date ON ycs_sales(dtm, id)
-- Add a covering index
CREATE INDEX p_name ON ycs_products(sales_id, name);
This ought to get rid of the "the tables are very large" problem, since all information required is now contained in the two indexes. Actually I seem to remember that the first index does not need id if the latter is the primary key.
If this is still not enough, because the two tables are too large, then you have no choice - you must avoid the JOIN. It is already going as fast as it can and if that's not enough, then it has to go.
I believe you can do this with a couple of TRIGGERs to maintain an ancillary daily sales report table (if you never have returned products, then just the one trigger on INSERT in sales will suffice) - try to go with just (product_id, sales_date, sales_count) and JOIN that with the product table to get the name upon output; but, if that is not enough, then use (product_id, product_name, sales_date, sales_count) and periodically update product_name to keep names synced by reading them off the primary table. Since sales_date is now unique and you run searches on that, you can declare sales_date a primary key and partition the ancillary table based on the sales year.
(Once or twice, when partitioning was not possible but I was confident that I would only very rarely cross the "ideal" partition boundary, I partitioned manually - i.e. sales_2012, sales_2013, sales_2014 - and built programmatically a UNION of the two or three years involved, followed by a regroup, resort and secondary totalization stage. Crazy as a March hare, yes, but it worked).

Slow Inner JOIN of 6 tables

Sorry my SQL knowledge is amateur.
SQL Fiddle: http://sqlfiddle.com/#!2/5640d/1
Please click the link above to refer to the database structure and query.
I have 6 tables,each data will take only one row in each table,and I have 3 same columns Custgroup,RandomNumber and user_id in all 6 tables.
Custgroup is a group name,within the group each data is with an unique RandomNumber.
The query is pretty slow at first run(took several seconds to few minutes randomly),after that will be fast,but for first few pages only.If I click to page 20 or 30+,it will be non stop loading(Just took about 5 minutes just now).And the data is not much,only 5000 rows,which will be in big trouble in future.And I still haven't add any WHERE clause yet as I need to have filtering for each columns in my website.(not my idea,requested by my boss).
I tried to changed it to LEFT JOIN,JOIN and any other ways I can found,but the loading is still slow.
I added INDEX for user_id,Custgroup AND RandomNumber of all tables.
Anyway to solve this problem?I never good in using JOIN,really slow for my database.
Or please let me know if my table structure is really bad,I'm willing to redo it.
Thanks.
**Edited
RUN EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tE ALL NULL NULL NULL NULL 5685
1 SIMPLE tA ALL NULL NULL NULL NULL 6072 Using join buffer
1 SIMPLE t1 ref user_id,Custgroup,RandomNumber RandomNumber 23 func 1 Using where
1 SIMPLE tB ALL NULL NULL NULL NULL 5868 Using where; Using join buffer
1 SIMPLE tC ALL NULL NULL NULL NULL 6043 Using where; Using join buffer
1 SIMPLE tD ALL NULL NULL NULL NULL 5906 Using where; Using join buffer
Keyname Type Unique Packed Column Cardinality Collation Null Comment
PRIMARY BTREE Yes No ID 6033 A
RandomNumber BTREE No No RandomNumber 6033 A
Custgroup BTREE No No Custgroup 1 A
user_id BTREE No No user_id 1 A
Edited: EXPLAIN EXTENDED .....
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tE ALL NULL NULL NULL NULL 6084 100.00
1 SIMPLE t1 ref user_id,Custgroup,RandomNumber RandomNumber 23 func 1 100.00 Using where
1 SIMPLE tB ALL NULL NULL NULL NULL 5664 100.00 Using where; Using join buffer
1 SIMPLE tC ALL NULL NULL NULL NULL 5976 100.00 Using where; Using join buffer
1 SIMPLE tA ALL NULL NULL NULL NULL 6065 100.00 Using where; Using join buffer
1 SIMPLE tD ALL NULL NULL NULL NULL 6286 100.00 Using where; Using join buffer
The logical indexing for such a structure would have to be
CREATE INDEX UserAddedRecord1_ndx ON UserAddedRecord1 (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_A_ndx ON UserAddedRecord1_A (Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_B_ndx ON UserAddedRecord1_B (Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_C_ndx ON UserAddedRecord1_C (Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_D_ndx ON UserAddedRecord1_D (Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_E_ndx ON UserAddedRecord1_E (Custgroup, RandomNumber);
And if you are going to add WHERE clauses, they ought to go in the relevant index before the JOIN conditions (provided you run an equal or IN search, e.g. City = "New York"). For example if City is in UserAddedRecord1_B, then UserAddedRecord1_B_ndx ought to be City, Custgroup, RandomNumber.
But at this point, I have to ask, why? Apparently you have records always for the same user. For example:
t1.Cell,t1.Name,t1.Gender,t1.Birthday
tA.Email,tA.State,tA.Address,tA.City,tA.Postcode
...it is obvious that you can't have two different users here (and having Email in the same block as Postcode tells me this was not really intended as a one-to-many relation).
tB.Website,tB.Description,
tC.Model,tC.Capital,tC.Registry,tC.NoEmployees,
tD.SetUpDate,tD.PeopleInCharge,tD.Certification,tD.AddOEM,
tD.NoResearcher,tD.RoomSize,tD.RegisterMessage,
tE.WebsiteName,tE.OriginalWebsite,tE.QQ,tE.MSN,tE.Skype
These are all portions of a single large "user information form", divided in (optional?) sections.
I surmise that this structure arose from some kind of legacy/framework system that mapped a form submission section to a table. So that someone may have an entry in tables B, C and E, and someone else in tables A, C and D.
If this is true, and if user_id is the same for all tables, then one way of having this go faster is to explicitly add a condition on user_id for each table, and suitably modify indexes and JOINs:
CREATE INDEX UserAddedRecord1_ndx ON UserAddedRecord1 (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_A_ndx ON UserAddedRecord1_A (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_B_ndx ON UserAddedRecord1_B (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_C_ndx ON UserAddedRecord1_C (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_D_ndx ON UserAddedRecord1_D (user_id, Custgroup, RandomNumber);
CREATE INDEX UserAddedRecord1_E_ndx ON UserAddedRecord1_E (user_id, Custgroup, RandomNumber);
... FROM UserAddedRecord1 t1
JOIN UserAddedRecord1_A tA USING (user_id, CustGroup, RandomNumber)
JOIN UserAddedRecord1_B tB USING (user_id, CustGroup, RandomNumber)
JOIN UserAddedRecord1_C tC USING (user_id, CustGroup, RandomNumber)
JOIN UserAddedRecord1_D tD USING (user_id, CustGroup, RandomNumber)
JOIN UserAddedRecord1_E tE USING (user_id, CustGroup, RandomNumber)
WHERE t1.user_id = '1'
Try fiddle
The thing to do would be to incorporate all the tables into one table with all the fields in one row and then, maybe, for legacy purposes, you might create VIEWs that look like tables 1, A, B, C, D and E, each with a "vertical" partition of the tuple. But the big SELECT you would run on the complete table having all the fields (and you would save on duplicate columns, too).

no more optimization for mysql table?

i think i've optimized what i could for the following tables structure:
CREATE TABLE `sal_forwarding` (
`sid` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`f_shop` INT(11) NOT NULL,
`f_offer` INT(11) DEFAULT NULL,
.
.
.
.
.
`f_affiliateId` TINYINT(3) UNSIGNED NOT NULL,
`forwardDate` DATE NOT NULL,
PRIMARY KEY (`sid`),
KEY `f_partner` (`f_partner`,`forwardDate`),
KEY `forwardDate` (`forwardDate`,`cid`),
KEY `forwardDate_2` (`forwardDate`,`f_shop`),
KEY `forwardDate_3` (`forwardDate`,`f_shop`,`f_partner`),
KEY `forwardDate_4` (`forwardDate`,`f_partner`,`cid`),
KEY `forwardDate_5` (`forwardDate`,`f_affiliateId`),
KEY `forwardDate_6` (`forwardDate`,`f_shop`,`sid`),
KEY `forwardDate_7` (`forwardDate`,`f_shop`,`cid`),
KEY `forwardDate_8` (`forwardDate`,`f_affiliateId`,`cid`)
) ENGINE=INNODB AUTO_INCREMENT=10946560 DEFAULT CHARSET=latin1
This is the explain Statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE sal_forwarding range forwardDate,forwardDate_2,forwardDate_3,forwardDate_4,forwardDate_5,forwardDate_6,forwardDate_7,forwardDate_8 forwardDate_7 3 (NULL) 1221784 Using where; Using index; Using filesort
The following Query needs 23 seconds for reading 2300 rows:
SELECT COUNT(sid),f_shop, COUNT(DISTINCT(cid))
FROM sal_forwarding
WHERE forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
GROUP BY f_shop
What can i do to improve the performance?
Thank you very much.
slight modification to what you had... use count(*) instead of an actual field. for the DISTINCT, you don't need () around it. It may be getting confused about all the indexes you have. Remove all other indexes on forwardDate with exception to having one based on (forwardDate, f_shop, cid ) (your current key7 index)
SELECT
COUNT(*),
f_shop,
COUNT(DISTINCT cid )
FROM
sal_forwarding
WHERE
forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
GROUP BY
f_shop
Then, for grins, and since nothing else appears to be working for you, try putting in a pre-subquery on the records, then sum from that, so it's not relying on any other index pages based on your near 11 million records (implied per Auto-increment value)...
SELECT
f_shop,
sum( PreQuery.Presum) totalCnt,
COUNT(*) dist_cid
FROM
( select f_shop, cid, count(*) presum
from sal_forwarding
WHERE forwardDate BETWEEN "2011-01-01" AND "2011-11-01"
group by f_shop, cid ) PreQuery
GROUP BY
f_shop
Since the inner pre-query is doing a simple count of records and grouping by F_Shop and C_ID (optimizable by the index), you will now have your distinct already rolled-up via a simple count... then do a SUM() of the inner count's "presum" column. Again, just another option to try and turn the tables, hope it works for you.
I don't think the (forwardDate, f_shop, cid) is good for this query. Not any better than a simple (forwardDate) index, because of the range condition on the forwardDate column.
You may try a (f_shop, cid, forwardDate) index.

Slow select query with left join is null and limit results

I have the following query that is being logged as a slow query:
EXPLAIN EXTENDED SELECT *
FROM (
`photo_data`
)
LEFT JOIN `deleted_photos` ON `deleted_photos`.`photo_id` = `photo_data`.`photo_id`
WHERE `deleted_photos`.`photo_id` IS NULL
ORDER BY `upload_date` DESC
LIMIT 50
Here's the output of explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE photo_data index NULL upload_date 8 NULL 142523
1 SIMPLE deleted_photos eq_ref photo_id photo_id 767 tbc.photo_data.photo_id 1 Using where; Not exists
I can see that it's having to go through all 142K records to pull the latest 50 out of the database.
I have these two indexes:
UNIQUE KEY `photo_id` (`photo_id`),
KEY `upload_date` (`upload_date`)
I was hoping hat the index key on upload_date would help limit the number rows. Anythoughts on what I can do to speed this up?
You could add a field to your photo_data table which shows whether or not it is deleted instead of having to find out this fact by joining to another table. Then if you add an index on (deleted, upload_date) your query should be very fast.

mysql optimization - display 10 most recent records, but also identify duplicate rows

I am new to mysql and I have been pulling my hair out about this problem for days. I need to improve/optimize this query so that it runs faster - right now its taking over 5 seconds.
Here is the query:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN stores as s
ON a.username=s.username
WHERE s.username is not null AND s.state='NC'
GROUP BY a.announcement_id
ORDER BY a.dt DESC LIMIT 0,10
Stores table consists of: store_id, username, name, state, city, zip, etc...
Announcements table consists of: announcement_id, msg, dt, username
The stores table has around 10,000 records and the announcements table has around 500,000 records.
What I'm trying to accomplish in english - display the 10 most recent store announcements BUT what makes this complicated is that stores can have multiple entries in the stores table with the same userid (one row per location). So if a chain store, lets say "Chipotle" sends an announcement, I want to display only one row for their announcement with a note next to it that says "this store has multiple locations". That's why I'm using the count(*) and group by, so if count(*) > 1 I know there are multiple locations for the announcement.
The where condition can be any state, city, or zip. Using SQL_NO_CACHE because announcements are updated frequently so you rarely get the same results, does that make sense?
I would really appreciate any suggestions of how to do this better. I know little about indexes, but I did create an index for the "username" field in both tables. Feel free to shred me apart here, I know I must be missing something.
Update --
DESC stores;
Field Type Null Key Default Extra
store_id int(11) NO PRI NULL auto_increment
username varchar(20) NO MUL NULL
name varchar(100) NO NULL
street varchar(100) NO NULL
city varchar(50) NO NULL
state varchar(2) NO NULL
zip varchar(15) NO NULL
DESC announcements;
Field Type Null Key Default Extra
dt datetime NO NULL
username varchar(20) NO MUL NULL
msg varchar(200) NO NULL
announcement_id int(11) NO PRI NULL auto_increment
EXPLAIN output;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index username PRIMARY 47 NULL 315001 Using temporary; Using filesort
1 SIMPLE b ref username username 62 a.username 1 Using where
Try something like this:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN
(
SELECT username, COUNT(username) as multiple FROM stores
WHERE username IS NOT NULL AND state = 'NC'
GROUP BY username
) as s
ON a.username=s.username
ORDER BY a.dt DESC LIMIT 10
If you are ordering on the dt column, but there is no index on that column, the MySQL will have to do a (slow, expensive) sort of all of your result rows on that column every time you run the query
Try adding an index on announcements.dt -- MySQL may be able to access the rows in order by using the index, and avoid the sorting step afterwards.
Change the order of tables in your JOIN, MySQL reads rows from the first table and then
finds matching rows in the second table. If you always filter your result by fields in the stores table then the stores table should be the leading table in your JOIN so it won't pick and sort unnecessary rows from the announcements table.
In the EXPLAIN output you pasted it seems like only one shop matched the query, switching the order of tables would cause it to only look for that specific shop in the announcements table.
Add an index on the dt column (having an indexed integer column with unixtime would be even better)
If possible - create an integer userID for each username and JOIN using that column (add an on index on that one as well)
Not sure if MySQL still has problems with this but replacing COUNT(*) with COUNT(1) might be helpful.