cumulative sum for each user and sorting - mysql

I'm trying to get the cumulative sum for each user.
related tables(just example):
[user]
id
nickname
A
AA
B
BB
[pointTable] user_id -> [user]id
id
user_id
point
piA
A
10
piB
B
8
[pointHistoryTable] point_id -> [point]id
id
point_id
gain
use
phi1
piA
25
0
phi2
piB
10
0
phi3
piA
0
10
phi4
piB
0
9
phi5
piB
7
0
(For gain-use column, only one of them has a value.)
The result I want:
nickname
current
cGainSum
cUseSum
AA
10
25
10
BB
8
17
9
The query I used(mysql v5.7):
#1
SELECT
user.nickname AS nickname,
pointTable.point AS current,
sub.cGainSum AS cGainSum,
sub.cUseSum AS cUseSum
FROM
(SELECT
point_id, SUM(gain) AS cGainSum, SUM(`use`) AS cUseSum
FROM
pointHistoryTable
GROUP BY point_id) sub
INNER JOIN
pointTable ON pointTable.id = sub.point_id
INNER JOIN
user ON user.id = pointTable.user_id
ORDER BY cGainSum DESC
LIMIT 20 OFFSET 0;
#2
SELECT
user.nickname AS nickname,
pointTable.id AS pointId,
pointTable.point AS current,
(SELECT
IFNULL(SUM(gain), 0)
FROM
pointHistoryTable
WHERE
point_id = pointId AND gain > 0) AS cGainSum,
(SELECT
IFNULL(SUM(`use`), 0)
FROM
pointHistoryTable
WHERE
point_id = pointId AND `use` > 0) AS cUseSum
FROM
pointTable
INNER JOIN
user ON user.id = pointTable.user_id
ORDER BY cGainSum DESC
LIMIT 20 OFFSET 0;
Both work. But sorting takes a long time. (20,000 users)
When sorting with current, #1 takes about 25s and #2 takes about 300ms.
However, when sorting by cumulative sum(cGainSum or cUseSum), #1 takes about 25s again and #2 takes about 50s.
So #1 always causes a slow query, and #2 causes a slow query when sorting by cumulative sum.
Any other suggestions?
++
I'm using this query in node api. The data is sorted by the request query. The request query can be current, cGainSum, or cUseSum.
like this...
SELECT (...) ORDER BY ${query} DESC LIMIT 20 OFFSET 0;
The offset uses the pagination related request query.
(included in the details)

I imagine that without all those sub queries it would run faster with just simple joins and aggregate function if I misunderstood something you are welcome to correct me, this is anyways what I came up with
SELECT user.nickname,pointTable.point,SUM(pointHistoryTable.gain), SUM(pointHistoryTable.use) FROM user
LEFT JOIN pointTable
ON user.id=pointTable.user_id
LEFT JOIN pointHistoryTable
ON pointHistoryTable.point_id=pointTable.id
GROUP BY user.id
ORDER BY ${query} DESC LIMIT 20 OFFSET 0;
EDIT:
Besides improving the query, like not creating to complicated subqueries etc. A very easy way to improve performance is the use of indexes. I would for starters create indexes for all the id's in all the tables. You simple do this by using CREATE INDEX indexname ON tablename (column1,column2, etc) in your case for pointHistory the query would look something like this
CREATE INDEX pointHistoryIndex ON pointHistoryTable ('id','point_id')

Related

MySQL : Group By Clause Not Using Index when used with Case

Im using MySQL
I cant change the DB structure, so thats not an option sadly
THE ISSUE:
When i use GROUP BY with CASE (as need in my situation), MYSQL uses
file_sort and the delay is humongous (approx 2-3minutes):
http://sqlfiddle.com/#!9/f97d8/11/0
But when i dont use CASE just GROUP BY group_id , MYSQL easily uses
index and result is fast:
http://sqlfiddle.com/#!9/f97d8/12/0
Scenerio: DETAILED
Table msgs, containing records of sent messages, with fields:
id,
user_id, (the guy who sent the message)
type, (0=> means it's group msg. All the msgs sent under this are marked by group_id. So lets say group_id = 5 sent 5 msgs, the table will have 5 records with group_id =5 and type=0. For type>0, the group_id will be NULL, coz all other types have no group_id as they are individual msgs sent to single recipient)
group_id (if type=0, will contain group_id, else NULL)
Table contains approx 10 million records for user id 50001 and with different types (i.e group as well as individual msgs)
Now the QUERY:
SELECT
msgs.*
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.user_id IN (50111)
AND msgs.type IN (0, 1, 5, 7)
GROUP BY CASE `msgs`.`type` WHEN 0 THEN `msgs`.`group_id` ELSE `msgs`.`id` END
ORDER BY `msgs`.`group_id` DESC
LIMIT 100
I HAVE to get summary in a single QUERY,
so msgs sent to group lets say 5 (have 5 records in this table) will be shown as 1 record for summary (i may show COUNT later, but thats not an issue).
The individual msgs have NULL as group_id, so i cant just put 'GROUP BY group_id ' coz that will Group all individual msgs to single record which is not acceptable.
Sample output can be something like:
id owner_id, type group_id COUNT
1 50001 0 2 5
1 50001 1 NULL 1
1 50001 4 NULL 1
1 50001 0 7 5
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 0 10 5
Now the problem is that the GROUP condition after using CASE (which i currently think that i have to because i only need to group by group_id if type=0) is causing alot of delay coz it's not using indexes which it does if i dont use CASE (like just group by group_id ). Please view SQLFiddles above to see the explain results
Can anyone plz give an advice how to get it optimized
UPDATE
I tried a workaround , that does somehow works out (drops INITIAL queries to 1sec). Using union, what it does is, to minimize the resultset by union that forces SQL to write on disk for filesort (due to huge resultset), limit the resultset of group msgs, and individual msgs (view query below)
-- first part of union retrieves group msgs (that have type 0 and needs to be grouped by group_id). Applies the limit to captivate the out of control result set
-- The second query retrieves individual msgs, (those with type !=0, grouped by msgs.id - not necessary but just to be save from duplicate entries due to joins). Applies the limit to captivate the out of control result set
-- JOins the two to retrieve the desired resultset
Here's the query:
SELECT
*
FROM
(
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (msgs.user_id = accounts.id)
WHERE 1
AND accounts.id IN (50111 ) AND type = 0
GROUP BY msgs.group_id
ORDER BY msgs.id DESC
LIMIT 40
)
UNION
ALL
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.type != 0
AND accounts.id IN (50111)
GROUP BY msgs.id
ORDER BY msgs.id
LIMIT 40
)
) AS temp
ORDER BY reference_id
LIMIT 20,20
But has alot of caveats,
-I need to handle the limit in inner queries as well. Lets say 20recs per page, and im on page 4. For inner queries , i need to apply limit 0,80, since im not sure which of the two parts had how many records in the previous 3 pages. So, as the records per page and number of pages grow, my query grows heavier. Lets say 1k rec per page, and im on page 100 , or 1K, the load gets heavier and time exponentially increases
I need to handle ordering in inner queries and then apply on the resultset prepared by union , conditions need to be applied on both inner queries seperately(but not much of an issue)
-Cant use calc_found_rows, so will need to get count using queries seperately
The main issue is the first one. The higher i go with the pagination , the heavier it gets
Would this run faster?
SELECT id, user_id, type, group_id
FROM
( SELECT id, user_id, type, group_id, IFNULL(group_id, id) AS foo
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
)
GROUP BY foo
ORDER BY `group_id` DESC
LIMIT 100
It needs INDEX(user_id, type).
Does this give the 'correct' answer?
SELECT DISTINCT *
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
GROUP BY IFNULL(group_id, id)
ORDER BY `group_id` DESC
LIMIT 100
(It needs the same index)

MySQL get two random rows (complex), performance

I need to get 2 random rows but not just with rand() because it's very bad for performance by 10k+ rows so I got this code from another question here:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder) v
WHERE rand() <= 5 / cnt
ORDER BY rand()
LIMIT 2
So I get 2 random rows from the table bilder and the performance is much better now. But I need to specify it a bit more. I need the rows only where the field geschlecht got the value female so i tryed:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder) v
WHERE rand() <= 5 / cnt AND geschlecht = 'female'
ORDER BY rand()
LIMIT 2
But now I sometimes get only one row and sometimes none. How can I do this right?
Suppose you have 100 rows in bilder but only 10 of those rows have geschlecht='female'.
The first query tests its random selection 100 times, and each time it has a 5/100 chance of selecting the row. The odds of picking no rows is therefore 0.95100 (for an explanation why see the Birthday Problem), in other words only 0.5% chance of picking no rows.
The second query tests its random selection only 10 times, and each time it still has 5/100 chance of selecting the row. The odds of picking no rows is 0.9510 which is 59.87% chance!
It would be better if you apply the condition to the count subquery as well:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder WHERE geschlect = 'female') v
WHERE rand() <= 5 / cnt AND geschlecht = 'female'
ORDER BY rand()
LIMIT 2
Now cnt is only 10, and the random chance is therefore 5/10 chance of selecting the row. So the odds of picking no rows is 0.5010, or less than 0.1%.

Mysql: Order by max N values from subquery

I'm about to throw in the towel with this.
Preface: I want to make this work with any N, but for the sake of simplicity, I'll set N to be 3.
I've got a query (MySQL, specifically) that needs to pull in data from a table and sort based on top 3 values from that table and after that fallback to other sort criteria.
So basically I've got something like this:
SELECT tbl.id
FROM
tbl1 AS maintable
LEFT JOIN
tbl2 AS othertable
ON
maintable.id = othertable.id
ORDER BY
othertable.timestamp DESC,
maintable.timestamp DESC
Which is all basic textbook stuff. But the issue is I need the first ORDER BY clause to only get the three biggest values in othertable.timestamp and then fallback on maintable.timestamp.
Also, doing a LIMIT 3 subquery to othertable and join it is a no go as this needs to work with an arbitrary number of WHERE conditions applied to maintable.
I was almost able to make it work with a user variable based approach like this, but it fails since it doesn't take into account ordering, so it'll take the FIRST three othertable values it finds:
ORDER BY
(
IF(othertable.timestamp IS NULL, 0,
IF(
(#rank:=#rank+1) > 3, null, othertable.timestamp
)
)
) DESC
(with a #rank:=0 preceding the statement)
So... any tips on this? I'm losing my mind with the problem. Another parameter I have for this is that since I'm only altering an existing (vastly complicated) query, I can't do a wrapping outer query. Also, as noted, I'm on MySQL so any solutions using the ROW_NUMBER function are unfortunately out of reach.
Thanks to all in advance.
EDIT. Here's some sample data with timestamps dumbed down to simpler integers to illustrate what I need:
maintable
id timestamp
1 100
2 200
3 300
4 400
5 500
6 600
othertable
id timestamp
4 250
5 350
3 550
1 700
=>
1
3
5
6
4
2
And if for whatever reason we add WHERE NOT maintable.id = 5 to the query, here's what we should get:
1
3
4
6
2
...because now 4 is among the top 3 values in othertable referring to this set.
So as you see, the row with id 4 from othertable is not included in the ordering as it's the fourth in descending order of timestamp values, thus it falls back into getting ordered by the basic timestamp.
The real world need for this is this: I've got content in "maintable" and "othertable" is basically a marker for featured content with a timestamp of "featured date". I've got a view where I'm supposed to float the last 3 featured items to the top and the rest of the list just be a reverse chronologic list.
Maybe something like this.
SELECT
id
FROM
(SELECT
tbl.id,
CASE WHEN othertable.timestamp IS NULL THEN
0
ELSE
#i := #i + 1
END AS num,
othertable.timestamp as othertimestamp,
maintable.timestamp as maintimestamp
FROM
tbl1 AS maintable
CROSS JOIN (select #i := 0) i
LEFT JOIN tbl2 AS othertable
ON maintable.id = othertable.id
ORDER BY
othertable.timestamp DESC) t
ORDER BY
CASE WHEN num > 0 AND num <= 3 THEN
othertimestamp
ELSE
maintimestamp
END DESC
Modified answer:
select ilv.* from
(select sq.*, #i:=#i+1 rn from
(select #i := 0) i
CROSS JOIN
(select m.*, o.id o_id, o.timestamp o_t
from maintable m
left join othertable o
on m.id = o.id
where 1=1
order by o.timestamp desc) sq
) ilv
order by case when o_t is not null and rn <=3 then rn else 4 end,
timestamp desc
SQLFiddle here.
Amend where 1=1 condition inside subquery sq to match required complex selection conditions, and add appropriate limit criteria after the final order by for paging requirements.
Can you use a union query as below?
(SELECT id,timestamp,1 AS isFeatured FROM tbl2 ORDER BY timestamp DESC LIMIT 3)
UNION ALL
(SELECT id,timestamp,2 AS isFeatured FROM tbl1 WHERE NOT id in (SELECT id from tbl2 ORDER BY timestamp DESC LIMIT 3))
ORDER BY isFeatured,timestamp DESC
This might be somewhat redundant, but it is semantically closer to the question you are asking. This would also allow you to parameterize the number of featured results you want to return.

SQL join with with where and having count() condition

I have 2 tables
Sleep_sessions [id, user_id, (some other values)]
Tones [id, sleep_sessions.id (FK), (some other values)]
I need to select 10 sleep_sessions where user_id = 55 and where each sleep_session record has at least 2 tone records associated with it.
I currently have the following;
SELECT `sleep_sessions`.*
FROM (`sleep_sessions`)
JOIN `tones` ON sleep_sessions.id = `tones`.`sleep_session_id`
WHERE `user_id` = 55
GROUP BY `sleep_sessions`.`id`
HAVING count(tones.id) > 4
ORDER BY `started` desc
LIMIT 10
However I've noticed that count(tone.id) is basically the entire of the tones table and not the current sleep_session being joined
Many thanks for your help,
Andy
I'm not sure what went wrong with your query. Maybe, try
HAVING count(*)
The following query might be a bit more readable (having can be a bit of a pain to understand):
SELECT *
FROM (`sleep_sessions`)
WHERE `user_id` = 55
AND (SELECT count(*) FROM `tones`
WHERE `sleep_sessions`.`id` = `tones`.`sleep_session_id`) > 4
ORDER BY `started` desc
LIMIT 10
The advantage of this is the fact that you won't mess up the wrong semantics you have created between your GROUP BY and ORDER BY clauses. Only MySQL would ever accept your original query. Here's some insight:
http://dev.mysql.com/doc/refman/5.6/en/group-by-hidden-columns.html

Why is my SQL so slow?

My table is reasonably small around 50,000 rows. My schema is as follows:
DAILY
match_id
user_id
result
round
tournament_id
Query:
SELECT user_id
FROM `daily`
WHERE user_id IN (SELECT user_id
FROM daily
WHERE round > 25
AND tournament_id = 24
AND (result = 'Won' OR result = 'Lost'))
Using the in keyword in the fashion you are is a very dangerous [from a performance perspective] thing to do. It will result in the sub query [(select user_id from daily where round > 25 and tournament_id=24 and (result='Won' or result='Lost'))] being ran 50,000 times in this case.
You'll want to convert this onto a join something to the effect of
select user_id from daily a join
(select user_id from daily where round > 25 and tournament_id=24 and (result='Won' or result='Lost')) b on a.user_id = b.user_id
Doing something similar to this will result in only two queries and a join.
As Cybernate pointed out in your specific example you can simply use where clauses, but I went ahead and suggested this in case your query is actually more complex than what you posted.
First verify and add Indexes as suggested earlier.
Also why are you using an in if you are querying data from same table.
Change your query to:
SELECT user_id
FROM daily
WHERE round > 25
AND tournament_id = 24
AND ( result = 'Won'
OR result = 'Lost' )
Your query only needs to be:
SELECT d.user_id
FROM DAILY d
WHERE d.round > 25
AND d.tournament_id = 24
AND d.result IN ('Won', 'Lost')
Indexes should be considered on:
DAILY.round
DAILY.tournament_id
DAILY.result
This should return in a millisecond.
SELECT user_id FROM daily WITH(NOLOCK)
where user_id in (select user_id from daily WITH(NOLOCK) where round > 25 and tournament_id = 24 and (result = 'Won' or result = 'Lost'))
Then make sure there is an index on the filter columns.
CREATE NONCLUSTERED INDEX IX_1 ON daily (round ASC, tournament_id ASC, result ASC)