Is there a more efficent way to write this query? - mysql

Ok imagine the following DB structure
USERS:
id | name | company_id
1 John 1
2 Jane 1
3 Jack 2
4 Jill 3
COMPANIES:
id | name
1 CompanyA
2 CompanyB
3 CompanyC
4 CompanyD
First I want to SELECT all the companies that have more than one user
SELECT
`c`.`name`
FROM `companies` AS `c`
LEFT JOIN `users` AS `u` ON `c`.`id` = `u`.`company_id`
GROUP BY `c`.`id`
HAVING COUNT(`u`.`id`) > 1
Easy enough. Now I want to SELECT all the users that belong to a company that has more than one user. I have this combined query but I think this is not efficent
SELECT * FROM `users` WHERE `company_id` = (
SELECT
`c`.`id`
FROM `companies` AS `c`
LEFT JOIN `users` AS `u` ON `c`.`id` = `u`.`company_id`
GROUP BY `c`.`id`
HAVING COUNT(`u`.`id`) > 1
)
Basically I take the id returned from the first query (companies that have more than 1 user) and then query the users table to find all users with that company.

Why not
SELECT * FROM users u GROUP BY u.company_id HAVING COUNT(u.id) > 1
You don't really need any information from the companies table according to the data you say needs returning. "Now I want to SELECT all the users that belong to a company that has more than one user."

try this:
SELECT u.id,u.name,u.company_id FROM users u
inner join companies c on u.company_id = c.id
group by c.id
having count(u.id) > 1

Simplest way to get the users only is probably to keep the subquery but eliminate the join; since it's not a correlated subquery, it should be fairly efficient (obviously an index on company_id helps here);
SELECT u.* FROM USERS u WHERE company_id IN (
SELECT company_id FROM USERS GROUP BY company_id HAVING COUNT(*)>1
);
You could for example rewrite it as a LEFT JOIN, but I suspect it will actually be less efficient since you'd most likely need to use a DISTINCT when using a JOIN;
SELECT DISTINCT u.*
FROM USERS u
LEFT JOIN USERS u2
ON u.company_id=u2.company_id AND u.id<>u2.id
WHERE u2.id IS NOT NULL;
An SQLfiddle to test both.

Try also a semi-join query:
SELECT *
FROM users u
WHERE EXISTS (
SELECT null FROM users u1
WHERE u.company_id=u1.company_id
AND u.id <> u1.id
)
demo --> http://www.sqlfiddle.com/#!2/12dc34/2
Assumming that id is a primary key column, creating an index on company_id column gives better performance.
If you are really obsessed with the performance of this query, create a composite index on columns company_id + id:
CREATE INDEX very_fast ON users( company_id, id );

Could you try this?
SELECT users.*
FROM users INNER JOIN
(
SELECT company_id
FROM users
GROUP BY company_id
HAVING COUNT(*) > 1
) x USING(company_id);
You should have an index INDEX(company_id)
Peformance Test
I have tested 3 queries in answers.
Q1 = sub-query (with GROUP BY) and INNER JOIN
Q2 = LEFT JOIN and IS NOT NULL
Q3 = EXISTS
All queries return same result. Test was done with TPC-H lineitem table. And The problem is "find lineitem have more than 1 item"
Test Results
It depends on what you want is retrieving FIRST N row or entire rows.
Q1 (get FIRST 10K rows) : 2.85 sec
Q2 (get FIRST 10K rows) : 0.03 sec
Q3 (get FIRST 10K rows) : 0.03 sec
Q1 (get all rows) : 8.19 sec
Q2 (get all rows) : 34.12 sec
Q3 (get all rows) : 29.54 sec
Schema and DATA
mysql> SELECT SQL_NO_CACHE COUNT(*) FROM lineitem\G
*************************** 1. row ***************************
COUNT(*): 11997996
1 row in set (1.68 sec)
mysql> SHOW CREATE TABLE lineitem\G
*************************** 1. row ***************************
Table: lineitem
Create Table: CREATE TABLE `lineitem` (
`l_orderkey` int(11) NOT NULL,
`l_partkey` int(11) NOT NULL,
`l_suppkey` int(11) NOT NULL,
`l_linenumber` int(11) NOT NULL,
`l_quantity` decimal(15,2) NOT NULL,
`l_extendedprice` decimal(15,2) NOT NULL,
`l_discount` decimal(15,2) NOT NULL,
`l_tax` decimal(15,2) NOT NULL,
`l_returnflag` char(1) NOT NULL,
`l_linestatus` char(1) NOT NULL,
`l_shipDATE` date NOT NULL,
`l_commitDATE` date NOT NULL,
`l_receiptDATE` date NOT NULL,
`l_shipinstruct` char(25) NOT NULL,
`l_shipmode` char(10) NOT NULL,
`l_comment` varchar(44) NOT NULL,
PRIMARY KEY (`l_orderkey`,`l_linenumber`),
KEY `l_orderkey` (`l_orderkey`),
KEY `l_partkey` (`l_partkey`,`l_suppkey`),
CONSTRAINT `lineitem_ibfk_1` FOREIGN KEY (`l_orderkey`) REFERENCES `orders` (`o_orderkey`),
CONSTRAINT `lineitem_ibfk_2` FOREIGN KEY (`l_partkey`, `l_suppkey`) REFERENCES `partsupp` (`ps_partkey`, `ps_suppkey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
Queries
Q1 FIRST 10K
SELECT SQL_NO_CACHE DISTINCT u.l_orderkey, u.l_linenumber
FROM lineitem u INNER JOIN
(
SELECT l_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING COUNT(*) > 1
) x USING (l_orderkey)
LIMIT 10000;
Q2 FIRST 10K
SELECT SQL_NO_CACHE DISTINCT u.l_orderkey, u.l_linenumber
FROM lineitem u
LEFT JOIN lineitem u2
ON u.l_orderkey=u2.l_orderkey AND u.l_linenumber<>u2.l_linenumber
WHERE u2.l_linenumber IS NOT NULL
LIMIT 10000;
Q3 FIRST 10K
SELECT SQL_NO_CACHE DISTINCT u.l_orderkey, u.l_linenumber
FROM lineitem u
WHERE EXISTS (
SELECT null FROM lineitem u1
WHERE u.l_orderkey=u1.l_orderkey
AND u.l_linenumber <> u1.l_linenumber
)
LIMIT 10000;
retrieve entire rows
Q1 ALL
SELECT SQL_NO_CACHE COUNT(*)
FROM lineitem u INNER JOIN
(
SELECT l_orderkey
FROM lineitem
GROUP BY l_orderkey
HAVING COUNT(*) > 1
) x USING (l_orderkey);
Q2 ALL
SELECT SQL_NO_CACHE COUNT(*)
FROM lineitem u
LEFT JOIN lineitem u2
ON u.l_orderkey=u2.l_orderkey AND u.l_linenumber<>u2.l_linenumber
WHERE u2.l_linenumber IS NOT NULL;
Q3 ALL
SELECT SQL_NO_CACHE COUNT(*)
FROM lineitem u
WHERE EXISTS (
SELECT null FROM lineitem u1
WHERE u.l_orderkey=u1.l_orderkey
AND u.l_linenumber <> u1.l_linenumber
);

Related

I want to optimize my sql query. Becasuse long response time

First İ use wherehas but then I decided use this way. This way result better than wherehas but It isn't satisfy me. Query response time is a 873ms. So I have 400k+ data in the table.
select count(*) as aggregate
from `orders`
where (`pickup_address_id` in (
select `id`
from `addresses`
where `region_id` = 12)
or `delivery_address_id` in (
select `id`
from `addresses`
where `region_id` = 12)
) and `orders`.`status` = 2
Try this:
select count(distinct o.`id`) as aggregate
from `orders` o
inner join `addresses` a ON a.`id` IN (o.`pickup_address_id`, o.`delivery_address_id`)
AND a.`region_id` = 12
where o.`status` = 2
Alternatively:
SELECT count(distinct id) as aggregate
FROM (
select o.`id`
from `orders` o
inner join `addresses` a ON a.`id` = o.`pickup_address_id`
AND a.`region_id` = 12
where o.`status` = 2
UNION
select o.`id`
from `orders` o
inner join `addresses` a ON a.`id` = o.`delivery_address_id`
AND a.`region_id` = 12
where o.`status` = 2
) t
But I don't know you'll improve much to look through 400K rows in less than a second.
First, you can try to eliminate multiple (twice, to be more precise) same subquery evaluation using a Common Table Expression
WITH CTE(id) AS (
SELECT id
FROM addresses
WHERE region_id = 12
)
This CTE would be evaluated once.
Second, get row count from orders table joined with cte on existence of pickup_address_id and delivery_address_id in cte.
WITH CTE(id) AS (
SELECT id
FROM addresses
WHERE region_id = 12
)
SELECT COUNT (*)
FROM orders
CROSS JOIN CTE ON CTE.id = orders.delivery_address_id
OR CTE.id = orders.pickup_address_id
Finally, add filter by status = 2 and query would be like
WITH CTE(id) AS (
SELECT id
FROM addresses
WHERE region_id = 12
)
SELECT COUNT (*)
FROM orders
CROSS JOIN CTE ON CTE.id = orders.delivery_address_id
OR CTE.id = orders.pickup_address_id
WHERE orders.status = 2
Also you should have the following indexes:
addresses table:
INDEX (region_id)
orders table:
INDEX (pickup_address_id),
INDEX (delivery_address_id),
INDEX (status)
Give it a try.
With empty tables I've got this
Schema (MySQL v8.0)
create table addresses (
id int primary key,
region_id int not null,
index(region_id)
);
create table orders (
id int primary key,
pickup_address_id int,
delivery_address_id int,
status int not null,
index (pickup_address_id),
index (delivery_address_id),
index(status),
foreign key (pickup_address_id) references addresses(id),
foreign key (delivery_address_id) references addresses(id)
);
Query #1
explain with cte(id) as (
select id from addresses where region_id = 12)
select count(*) from orders
cross join cte on cte.id = orders.delivery_address_id
or cte.id = orders.pickup_address_id
where status = 2;
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
orders
ref
pickup_address_id,delivery_address_id,status
status
4
const
1
100
1
SIMPLE
addresses
ref
PRIMARY,region_id
region_id
4
const
1
100
Using where; Using index

Optimizing MySQL query removing subquery

Having these tables:
customers
---------------------
`id` smallint(5) unsigned NOT NULL auto_increment,
`name` varchar(100) collate utf8_unicode_ci default NOT NULL,
....
customers_subaccounts
-------------------------
`companies_id` mediumint(8) unsigned NOT NULL,
`customers_id` mediumint(8) unsigned NOT NULL,
`subaccount` int(10) unsigned NOT NULL
I need to get all the customers whom have been assigned more than one subaccount for the same company.
This is what I've got:
SELECT * FROM customers
WHERE id IN
(SELECT customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(subaccount) > 1)
This query is too slow though. It's even slower if I add the DISTINCT modifier to customers_id in the SELECT of the subquery, which in the end retrieves the same customers list for the whole query. Maybe there's a better way without subquerying, anything faster will help, and I'm not sure whether it will retrieve an accurate correct list.
Any help?
You can replace the subquery with an INNER JOIN:
SELECT t1.id
FROM customers t1
INNER JOIN
(
SELECT DISTINCT customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(*) > 1
) t2
ON t1.id = t2.customers_id
You can also try using EXISTS() which may be faster then a join :
SELECT * FROM customers t
WHERE EXISTS(SELECT 1 FROM customers_subaccounts s
WHERE s.customers_id = t.id
GROUP BY s.customers_id, s.companies_id
HAVING COUNT(subaccount) > 1)
You should also considering adding the following indexes(if not exists yet) :
customers_subaccounts (customers_id,companies_id,subaccount)
customers (id)
Assuming that you want different subaccounts for the company (or that they are guaranteed to be different anyway), then the following could be faster under some circumstances:
select c.*
from (select distinct cs.customers_id
from customers_subaccounts cs join
customers_subaccounts cs2
on cs.customers_id = cs2.customers_id and
cs.companies_id = cs2.companies_id and
cs.subaccount < cs2.subaccount
) cc join
customers c
on c.customers_id = cc.customers_id;
In particular, this can take advantage of an index on customers_subaccounts(customers_id, companies_id, subaccount).
Note: This assumes that the subaccounts are different for the rows you want. What is really needed is a way of defining unique rows in the customers_subaccounts table.
There is a way to speed up the query by using cache the sub-query result. A simple change in your query aware mysql that can cache the sub-query result:
SELECT * FROM customers
WHERE id IN
(select * from
(SELECT distinct customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(subaccount) > 1) t1);
I used it many years ago and it helped me very much.
Try following;)
SELECT DISTINCT t1.*
FROM customers t1
INNER JOIN customers_subaccounts t2 ON t1.id = t2.customers_id
GROUP BY t1.id, t1.name, t2.companies_id
HAVING COUNT(t2.subaccount) > 1
Also you may add index on customers_id.

Why is this query really slow with 70k+ rows?

First of all, this is my table structure:
CREATE TABLE IF NOT EXISTS `site_forum_comments` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`forum_id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`data` int(11) NOT NULL,
`comment` longtext NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
Before importing my backup, it had like 10-15 rows and I made a ranking system based on number of comments and this query was working flawlessly:
SELECT u.id, u.username, COUNT(f.id) AS rank
FROM site_users AS u
LEFT JOIN site_forum_comments AS f ON (f.user_id = u.id)
GROUP BY u.id
ORDER BY rank DESC
LIMIT :l
But now, with more than 70k rows inserted, the script won't even load and just crashes the server.
What have I possibly done wrong? Is this problem about the query specifically or is it the table structure?
Thanks in advance, cheers!
This is your query:
SELECT u.id, u.username, COUNT(f.id) AS rank
FROM site_users u LEFT JOIN
site_forum_comments f
ON f.user_id = u.id
GROUP BY u.id
ORDER BY rank DESC
LIMIT :l
Because you are choosing the highest ranked user, you can probably use an inner join rather than an outer join. In any case, this version doesn't have a great many optimization opportunities. But, you need an index on site_forum_comments(user_id, id).
You might get better performance with the same index and a correlated subquery:
SELECT u.id, u.username,
(SELECT COUNT(*)
FROM site_forum_comments f
WHERE f.user_id = u.id
) as rank
FROM site_users u
ORDER BY rank DESC
LIMIT :l;
You are currently joining all users to their comments without an index on the user_id column thats slow.
The following query will select the highest user first and only join that one user with the highest rank with the site_users table (using the index over site_users.id). So it should be faster.
SELECT site_users.id, site_users.username, a.rank
FROM (
SELECT user_id, COUNT(*) as rank
FROM site_forum_comments
GROUP BY user_id
ORDER BY rank DESC
LIMIT 1
) AS a
LEFT JOIN site_users ON a.user_id = site_users.id
note that with this query you won't get a result if the rank is 0

GROUP BY with MAX date field - erratic results

Have a table containing form data. Each row contains a section_id and field_id. There are 50 distinct fields for each section. As users update an existing field, a new row is inserted with an updated date_modified. This keeps a rolling archive of changes.
The problem is that I'm getting erratic results when pulling the most recent set of fields to display on a page.
I've narrowed down the problem to a couple of fields, and have recreated a portion of the table in question on SQLFiddle.
Schema:
CREATE TABLE IF NOT EXISTS `cTable` (
`section_id` int(5) NOT NULL,
`field_id` int(5) DEFAULT NULL,
`content` text,
`user_id` int(11) NOT NULL,
`date_modified` datetime NOT NULL,
KEY `section_id` (`section_id`),
KEY `field_id` (`field_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This query shows all previously edited rows for field_id 39. There are five rows returned:
SELECT cT.*
FROM cTable cT
WHERE
cT.section_id = 123 AND
cT.field_id=39;
Here's what I'm trying to do to pull the most recent row for field_id 39. No rows returned:
SELECT cT.*
FROM cTable cT
INNER JOIN (
SELECT field_id, MAX(date_modified) AS date_modified
FROM cTable GROUP BY field_id
) AS max USING (field_id, date_modified)
WHERE
cT.section_id = 123 AND
cT.field_id=39;
Record Count: 0;
If I try the same query on a different field_id, say 54, I get the correct result:
SELECT cT.*
FROM cTable cT
INNER JOIN (
SELECT field_id, MAX(date_modified) AS date_modified
FROM cTable GROUP BY field_id
) AS max USING (field_id, date_modified)
WHERE
cT.section_id = 123 AND
cT.field_id=54;
Record Count: 1;
Why would same query work on one field_id, but not the other?
In your subquery from where you are getting maxima you need to GROUP BY section_id,field_id using just GROUP BY field_id is skipping the section id, on which you are applying filter
SELECT cT.*
FROM cTable cT
INNER JOIN (
SELECT section_id,field_id, MAX(date_modified) AS date_modified
FROM cTable GROUP BY section_id,field_id
) AS max
ON(max.field_id =cT.field_id
AND max.date_modified=cT.date_modified
AND max.section_id=cT.section_id
)
WHERE
cT.section_id = 123 AND
cT.field_id=39;
See Fiddle Demo
You are looking for the max(date_modified) per field_id. But you should look for the max(date_modified) per field_id where the section_id is 123. Otherwise you may find a date for which you find no match later.
SELECT cT.*
FROM cTable cT
INNER JOIN (
SELECT field_id, MAX(date_modified) AS date_modified
FROM cTable
WHERE section_id = 123
GROUP BY field_id
) AS max USING (field_id, date_modified)
WHERE
cT.section_id = 123 AND
cT.field_id=39;
Here is the SQL fiddle: http://www.sqlfiddle.com/#!2/0cefd8/19.

Update mysql table based with group_concat

UPDATE BELOW!
Who can help me out
I have a table:
CREATE TABLE `group_c` (
`parent_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`child_id` int(11) DEFAULT NULL,
`number` int(11) DEFAULT NULL,
PRIMARY KEY (`parent_id`)
) ENGINE=InnoDB;
INSERT INTO group_c(parent_id,child_id)
VALUES (1,1),(2,2),(3,3),(4,1),(5,4),(6,4),(7,6),(8,1),(9,2),(10,1),(11,1),(12,1),(13,0);
I want to update the number field to 1 for each child that has multiple parents:
SELECT group_concat(parent_id), count(*) as c FROM group_c group by child_id having c>1
Result:
GROUP_CONCAT(PARENT_ID) C
12,11,10,8,1,4 6
9,2 2
6,5 2
So all rows with parent_id 12,11,10,8,1,4,9,2,6,5 should be updated to number =1
I've tried something like:
UPDATE group_c SET number=1 WHERE FIND_IN_SET(parent_id, SELECT pid FROM (select group_concat(parent_id), count(*) as c FROM group_c group by child_id having c>1));
but that is not working.
How can I do this?
SQLFIDDLE: http://sqlfiddle.com/#!2/acb75/5
[edit]
I tried to make the example simple but the real thing is a bit more complicated since I'm grouping by multiple fields. Here is a new fiddle: http://sqlfiddle.com/#!2/7aed0/11
Why use GROUP_CONCAT() and then try to do something with it's result via FIND_IN_SET() ? That's not how SQL is intended to work. You may use simple JOIN to retrieve your records:
SELECT
parent_id
FROM
group_c
INNER JOIN
(SELECT
child_id,
count(*) as c
FROM
group_c
group by
child_id
having c>1) AS childs
ON childs.child_id=group_c.child_id
-check your modified demo. If you want UPDATE, then just use:
UPDATE
group_c
INNER JOIN
(SELECT
child_id,
count(*) as c
FROM
group_c
group by
child_id
having c>1) AS childs
ON childs.child_id=group_c.child_id
SET
group_c.number=1
For anyone interested. This is how I solved it. It's in two queries but in my case it's not really an issue.
UPDATE group_c INNER JOIN (
SELECT parent_id, count( * ) AS c
FROM `group_c`
GROUP BY child1,child2
HAVING c >1
) AS cc ON cc.parent_id = group_c.parent_id
SET group_c.number =1 WHERE number =0;
UPDATE group_c INNER JOIN group_c as gc ON
(gc.child1=group_c.child1 AND gc.child2=group_c.child2 AND gc.number=1)
SET group_c.number=1;
fiddle: http://sqlfiddle.com/#!2/46d0b4/1/0
Here's a similar solution...
UPDATE group_c a
JOIN
( SELECT DISTINCT x.child_id candidate
FROM group_c x
JOIN group_c y
ON y.child_id = x.child_id
AND y.parent_id < x.parent_id
) b
ON b.candidate = a.child_id
SET number = 1;
http://sqlfiddle.com/#!2/bc532/1