SQL join with with where and having count() condition - mysql

I have 2 tables
Sleep_sessions [id, user_id, (some other values)]
Tones [id, sleep_sessions.id (FK), (some other values)]
I need to select 10 sleep_sessions where user_id = 55 and where each sleep_session record has at least 2 tone records associated with it.
I currently have the following;
SELECT `sleep_sessions`.*
FROM (`sleep_sessions`)
JOIN `tones` ON sleep_sessions.id = `tones`.`sleep_session_id`
WHERE `user_id` = 55
GROUP BY `sleep_sessions`.`id`
HAVING count(tones.id) > 4
ORDER BY `started` desc
LIMIT 10
However I've noticed that count(tone.id) is basically the entire of the tones table and not the current sleep_session being joined
Many thanks for your help,
Andy

I'm not sure what went wrong with your query. Maybe, try
HAVING count(*)
The following query might be a bit more readable (having can be a bit of a pain to understand):
SELECT *
FROM (`sleep_sessions`)
WHERE `user_id` = 55
AND (SELECT count(*) FROM `tones`
WHERE `sleep_sessions`.`id` = `tones`.`sleep_session_id`) > 4
ORDER BY `started` desc
LIMIT 10
The advantage of this is the fact that you won't mess up the wrong semantics you have created between your GROUP BY and ORDER BY clauses. Only MySQL would ever accept your original query. Here's some insight:
http://dev.mysql.com/doc/refman/5.6/en/group-by-hidden-columns.html

Related

MySQL : Group By Clause Not Using Index when used with Case

Im using MySQL
I cant change the DB structure, so thats not an option sadly
THE ISSUE:
When i use GROUP BY with CASE (as need in my situation), MYSQL uses
file_sort and the delay is humongous (approx 2-3minutes):
http://sqlfiddle.com/#!9/f97d8/11/0
But when i dont use CASE just GROUP BY group_id , MYSQL easily uses
index and result is fast:
http://sqlfiddle.com/#!9/f97d8/12/0
Scenerio: DETAILED
Table msgs, containing records of sent messages, with fields:
id,
user_id, (the guy who sent the message)
type, (0=> means it's group msg. All the msgs sent under this are marked by group_id. So lets say group_id = 5 sent 5 msgs, the table will have 5 records with group_id =5 and type=0. For type>0, the group_id will be NULL, coz all other types have no group_id as they are individual msgs sent to single recipient)
group_id (if type=0, will contain group_id, else NULL)
Table contains approx 10 million records for user id 50001 and with different types (i.e group as well as individual msgs)
Now the QUERY:
SELECT
msgs.*
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.user_id IN (50111)
AND msgs.type IN (0, 1, 5, 7)
GROUP BY CASE `msgs`.`type` WHEN 0 THEN `msgs`.`group_id` ELSE `msgs`.`id` END
ORDER BY `msgs`.`group_id` DESC
LIMIT 100
I HAVE to get summary in a single QUERY,
so msgs sent to group lets say 5 (have 5 records in this table) will be shown as 1 record for summary (i may show COUNT later, but thats not an issue).
The individual msgs have NULL as group_id, so i cant just put 'GROUP BY group_id ' coz that will Group all individual msgs to single record which is not acceptable.
Sample output can be something like:
id owner_id, type group_id COUNT
1 50001 0 2 5
1 50001 1 NULL 1
1 50001 4 NULL 1
1 50001 0 7 5
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 0 10 5
Now the problem is that the GROUP condition after using CASE (which i currently think that i have to because i only need to group by group_id if type=0) is causing alot of delay coz it's not using indexes which it does if i dont use CASE (like just group by group_id ). Please view SQLFiddles above to see the explain results
Can anyone plz give an advice how to get it optimized
UPDATE
I tried a workaround , that does somehow works out (drops INITIAL queries to 1sec). Using union, what it does is, to minimize the resultset by union that forces SQL to write on disk for filesort (due to huge resultset), limit the resultset of group msgs, and individual msgs (view query below)
-- first part of union retrieves group msgs (that have type 0 and needs to be grouped by group_id). Applies the limit to captivate the out of control result set
-- The second query retrieves individual msgs, (those with type !=0, grouped by msgs.id - not necessary but just to be save from duplicate entries due to joins). Applies the limit to captivate the out of control result set
-- JOins the two to retrieve the desired resultset
Here's the query:
SELECT
*
FROM
(
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (msgs.user_id = accounts.id)
WHERE 1
AND accounts.id IN (50111 ) AND type = 0
GROUP BY msgs.group_id
ORDER BY msgs.id DESC
LIMIT 40
)
UNION
ALL
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.type != 0
AND accounts.id IN (50111)
GROUP BY msgs.id
ORDER BY msgs.id
LIMIT 40
)
) AS temp
ORDER BY reference_id
LIMIT 20,20
But has alot of caveats,
-I need to handle the limit in inner queries as well. Lets say 20recs per page, and im on page 4. For inner queries , i need to apply limit 0,80, since im not sure which of the two parts had how many records in the previous 3 pages. So, as the records per page and number of pages grow, my query grows heavier. Lets say 1k rec per page, and im on page 100 , or 1K, the load gets heavier and time exponentially increases
I need to handle ordering in inner queries and then apply on the resultset prepared by union , conditions need to be applied on both inner queries seperately(but not much of an issue)
-Cant use calc_found_rows, so will need to get count using queries seperately
The main issue is the first one. The higher i go with the pagination , the heavier it gets
Would this run faster?
SELECT id, user_id, type, group_id
FROM
( SELECT id, user_id, type, group_id, IFNULL(group_id, id) AS foo
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
)
GROUP BY foo
ORDER BY `group_id` DESC
LIMIT 100
It needs INDEX(user_id, type).
Does this give the 'correct' answer?
SELECT DISTINCT *
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
GROUP BY IFNULL(group_id, id)
ORDER BY `group_id` DESC
LIMIT 100
(It needs the same index)

Need help optimizing 4 heavy queries on one webpage

I have four queries that run on one web page. I use them for statistics and they are taking too long to load.
Here are my current configurations
use the text wrapping button on pastebin to make it easier to read.
I have a lot of RAM dedicated to mysql but it still takes a long time. I have also index most of the columns.
I'm just trying to see what other options I have.
I put "show create table" and total count(*) in here. I'm going to rename everything and paste in SO. I agree that someone in the future may use it.
QUERY ONE
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsresults
WHERE
DID = 28
AND ActionTypeID = 1
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
This, I would have a covering index based on your key elements so the engine does not have to go back to the raw data... Based on this and your following queries, I would have THAT column in the primary index position such as
StatisticsResults -- index ( DID, ActionTypeID, DateActioned )
The order by by respective year() descending and month() descending will do the same thing as your hard-coded references to FIND the field in the list.
QUERY TWO
-- 381.812
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsdivision
WHERE
DID = 28
AND ActionTypeID = 9
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
ON this one, the DID = '28', I changed to DID = 28. If the column is numeric, don't offer confusion to the engine to try and convert one to the other. The same indexes from option 1 would apply here too.
QUERY THREE
-- 33.899
SELECT SQL_NO_CACHE DISTINCT
AID,
COUNT(*) AS acount
FROM
db.statisticsresults
JOIN db.division_id USING(AID)
WHERE
DID = 28
GROUP BY
AID
ORDER BY
count(*) DESC
LIMIT
19
This one looks like a bit of a waste... you are joining to the division table based on an "AID" column in the stats table. Why are you doing the join unless you actually are expecting some invalid "AID" values not in the division table? Again, change your "DID" column to 28 instead of '28'. Ensure your division table has its index on "AID" for the join. The SECOND index from query 1 appears to be your better option
QUERY FOUR
-- 21.403
SELECT SQL_NO_CACHE DISTINCT
TID,
tax,
agent,
COUNT(*) AS t_count
FROM
db.statisticsresults sr
JOIN db.tax_id USING(TID)
JOIN db.agent_id ai ON(ai.AID = sr.AID)
WHERE
DID = 28
GROUP BY
TID,
sr.AID
ORDER BY
COUNT(*) DESC
LIMIT 19
Again, "DID" column from '28' to 28
FOR your TAX_ID table, have a covering index on that too so it can handle the join
TO the agent table without going TO the raw page data
Tax_ID -- index ( tid, aid )
Finally, if you are dealing with your original list finding things only from Jan 2012 to Dec 2013, you can simplify querying the ENTIRE table of stats by adding to your WHERE clause...
AND DateActioned >= '2012-01-01'
So you completely skip over anything prior to 2012 (old data I presume?)

How can I write a query that aggregate a single row with latest date among multiple set of rows?

I have a MySQL table where there are many rows for each person, and I want to write a query which aggregates rows with special constraint. (one per person)
For example, lets say the table is consist of following data.
name date reason
---------------------------------------
John 2013-04-01 14:00:00 Vacation
John 2013-03-31 18:00:00 Sick
Ted 2012-05-06 20:00:00 Sick
Ted 2012-02-20 01:00:00 Vacation
John 2011-12-21 00:00:00 Sick
Bob 2011-04-02 20:00:00 Sick
I want to see the distribution of 'reason' column. If I just write a query like below
select reason, count(*) as count from table group by reason
then I will be able to see number of reasons for this table overall.
reason count
------------------
Sick 4
Vacation 2
However, I am only interested in single reason from each person. The reason that should be counted should be from a row with latest date from the person's records. For example, John's latest reason would be Vacation while Ted's latest reason would be Sick. And Bob's latest reason (and the only reason) is Sick.
The expected result for that query should be like below. (Sum of count will be 3 because there are only 3 people)
reason count
-----------------
Sick 2
Vacation 1
Is it possible to write a query such that single latest reason will be counted when I want to see distribution(count) of reasons?
Here are some facts about the table.
The table has tens of millions of rows
For most of times, each person has one reason.
Some people have multiple reasons, but 99.99% of people have fewer than 5 reasons.
There are about 30 different reasons while there are millions of distinct names.
The table is partitioned based on date range.
SELECT T.REASON, COUNT(*)
FROM
(
SELECT PERSON, MAX(DATE) AS MAX_DATE
FROM TABLE-NAME
GROUP BY PERSON
) A, TABLE-NAME T
WHERE T.PERSON = A.PERSON AND T.DATE = A.MAX_DATE
GROUP BY T.REASON
Try this
select reason, count(*) from
(select reason from table where date in
(select max(date) from table group by name)) t
group by reason
In MySQL, it's not very efficient to do this kind of query since you don't have access to tools like partitionning query in SQL Server or Oracle.
You can still emulate it by doing a subquery and retrieve the rows based on the condition you need, here the maximum date :
SELECT t.reason, COUNT(1)
FROM
(
SELECT name, MAX(adate) AS maxDate
FROM #aTable
GROUP BY name
) maxDateRows
INNER JOIN #aTable t ON maxDateRows.name = t.name
AND maxDateRows.maxDate = t.adate
GROUP BY t.reason
You can see a sample here.
Test this query on your samples, but I'm afraid that it will be slow as hell.
For your information, you can do the same thing in a more elegant and much much faster way in SQL Server :
SELECT reason, COUNT(1)
FROM
(
SELECT name
, reason
, RANK() OVER(PARTITION BY name ORDER BY adate DESC) as Rank
FROM #aTable
) AS rankTable
WHERE Rank = 1
GROUP BY reason
The sample is here
If you are really stuck to MySql, and the first query is too slow, then you can split the problem.
Do a first query creating a table:
CREATE TABLE maxDateRows AS
SELECT name, MAX(adate) AS maxDate
FROM #aTable
GROUP BY name
Then create index on both name and maxDate.
Finally, get the results :
SELECT t.reason, COUNT(1)
FROM maxDateRows m
INNER JOIN #aTable t ON m.name = t.name
AND m.maxDate = t.adate
GROUP BY t.reason
The solution you are looking for seems to be solved by this query :
select
reason,
count(*)
from (select * from tablename group by name) abc
group by
reason
It is quite fast and simple. You can view the SQL Fiddle
Apologies if this answer duplicates an existing. Maybe I'm suffering from some form aphasia but I cannot see it...
SELECT x.reason
, COUNT(*)
FROM absentism x
JOIN
( SELECT name,MAX(date) max_date FROM absentism GROUP BY name) y
ON y.name = x.name
AND y.max_date = x.date
GROUP
BY reason;

Group BY and ORDER BY optimization

I'm trying to optimize a query on a table of 180 000 rows.
SELECT
qid
FROM feed_pool
WHERE (
(feed_pool.uid IN
(SELECT uid_followed
FROM followers
WHERE uid_follower = 123 AND unfollowed = 0) AND feed_pool.uid != 123)
OR feed_pool.tid IN (SELECT tid FROM follow_tags WHERE follow_tags.uid = 123)
)
GROUP BY feed_pool.qid ORDER BY feed_pool.id DESC LIMIT 20
The worst part of this query is not the WHERE clause, it is the GROUP BY and ORDER BY part.
Actually, if I do just the GROUP BY, it's fine. Just the ORDER BY is also fine. The problem is when I use both.
I have tried different indexes, and I'm now using an index on feedpool.qid and feedpool.uid.
A good hack is to first SELECT the last 20 rows (ORDER BY), and then do the GROUP BY. But obviously it's not exactly what I want to do, in some cases I don't have 20 rows in the end.
I really don't know what to do. I can change my structure if it optimizes my request (20 sec...). Really, every tip would be appreciated.
Thanks in advance.
try this
GROUP BY feed_pool.qid ORDER BY 1 DESC LIMIT 20
Do you hear about JOIN? Subqueries is always bad for perfomance.
Try something like this:
SELECT feed_pool.qid, followers.uid as fuid, follow_tags as ftuid
FROM feed_pool
LEFT JOIN followers
ON feed_pool.uid = followers.uid_followed
AND followers.uid_follower = 123
AND followers.unfollowed = 0
AND feed_pool.uid != 123
LEFT JOIN follow_tags
ON feed_pool.tid = follow_tags.tid
AND follow_tags.uid = 123
WHERE
fuid IS NOT NULL
OR ftuid IS NOT NULL
ORDER BY feed_pool.id DESC
LIMIT 20

Why is my SQL so slow?

My table is reasonably small around 50,000 rows. My schema is as follows:
DAILY
match_id
user_id
result
round
tournament_id
Query:
SELECT user_id
FROM `daily`
WHERE user_id IN (SELECT user_id
FROM daily
WHERE round > 25
AND tournament_id = 24
AND (result = 'Won' OR result = 'Lost'))
Using the in keyword in the fashion you are is a very dangerous [from a performance perspective] thing to do. It will result in the sub query [(select user_id from daily where round > 25 and tournament_id=24 and (result='Won' or result='Lost'))] being ran 50,000 times in this case.
You'll want to convert this onto a join something to the effect of
select user_id from daily a join
(select user_id from daily where round > 25 and tournament_id=24 and (result='Won' or result='Lost')) b on a.user_id = b.user_id
Doing something similar to this will result in only two queries and a join.
As Cybernate pointed out in your specific example you can simply use where clauses, but I went ahead and suggested this in case your query is actually more complex than what you posted.
First verify and add Indexes as suggested earlier.
Also why are you using an in if you are querying data from same table.
Change your query to:
SELECT user_id
FROM daily
WHERE round > 25
AND tournament_id = 24
AND ( result = 'Won'
OR result = 'Lost' )
Your query only needs to be:
SELECT d.user_id
FROM DAILY d
WHERE d.round > 25
AND d.tournament_id = 24
AND d.result IN ('Won', 'Lost')
Indexes should be considered on:
DAILY.round
DAILY.tournament_id
DAILY.result
This should return in a millisecond.
SELECT user_id FROM daily WITH(NOLOCK)
where user_id in (select user_id from daily WITH(NOLOCK) where round > 25 and tournament_id = 24 and (result = 'Won' or result = 'Lost'))
Then make sure there is an index on the filter columns.
CREATE NONCLUSTERED INDEX IX_1 ON daily (round ASC, tournament_id ASC, result ASC)