I am having trouble with understanding how to solve a seemingly simple problem of sorting results.
I want to compare how many other users like the same fruits as like the user with ID 1, a count who has the most matches and display the results in descending order.
users:
1 jack
2 john
3 jim
fruits:
id, title
1 apple
2 banana
3 orange
4 pear
5 mango
relations: 2 indexes (user_id, fruit_id) and (fruit_id, user_id)
user_id, fruit_id
1 1
1 2
1 5
2 1
2 2
2 4
3 3
3 1
expected results: (comparing with Jack's favourite fruits (user_id=1))
user_id, count
1 3
2 2
3 1
Query:
SELECT user_id, COUNT(*) AS count FROM relations
WHERE fruit_id IN (SELECT fruit_id FROM relations WHERE user_id=1)
GROUP BY user_id
HAVING count>=2
More "optimized" query:
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE EXISTS (SELECT 1 FROM relations WHERE user_id=1 and r.fruit_id=fruit_id)
GROUP BY user_id
HAVING count>=2
2 is the minimum number of matches. (required for the future)
explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY r index NULL uid 8 NULL 15 Using where; Using index
2 DEPENDENT SUBQUERY relations eq_ref xox,uid xox 8 r.relations,const 1 Using where; Using index
Everything is working fine, until I try to use ORDER BY count DESC
Then I see: Using temporary; Using filesort
I don't want to use temporary tables or filesort. Because in the future, the database should be under high load.
I know, this is how SQL is defined and how it operates. But I can not figure out how to do it in other way? Without temporary tables and filesort.
I need to show the users who has the most matches first.
Please, help me out.
UPDATE:
I did some tests with the query from Walker Farrow (which is still uses a filesort).
20,000 rows - avg 0.05 seconds
120,000 0.20 sec.
1,100,000 2.9 sec.
disappointing results.
It would be possible to change the tables structure, but, with such a counting and sorting - I don't know how.
Is there any suggestions on how this can be done?
Probably the best way to do this would be to create a subquery and then order by in the outer-query, something like this:
select *
from (
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE EXISTS (SELECT 1 FROM relations WHERE user_id=1 and r.fruit_id=fruit_id)
GROUP BY user_id
HAVING count(*)>=2
) x
order by count desc
Also, I don't know why you need to add exists. Can you just say the following:
select *
from (
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE user_id=1
GROUP BY user_id
HAVING count(*)>=2
) x
order by count desc
?
I am not sure, maybe I am missing something. HOpe that helps!
Related
I have a MESSAGE table with 1M rows (and growing). Every query for messages involves selecting rows WHERE isRequest = True or WHERE isRequest = False, but never both. The vast majority of my queries are looking for isRequest = False. This table is written to extremely frequently and I need to maintain fast writes (as users love to send messages to each other with low latency). Also note that the MESSAGE table currently has no column indexes other than the primary key.
95% of the rows have isRequest = False and only 5% of rows have isRequest = True. Is it more performant to index the isRequest boolean field in such a scenario?
In addition, I understand that indexing columns consumes memory but is this overhead equivalent for all column data types including, in my case, boolean values?
Update:
After further analysis with #Rick James we have come up with a new table scheme (note all PKs are auto-inc so time relativity is discernible):
MESSAGE (id=PK) (sender_id, recipient_id, conversation_id = FKs)
---------------------------------------------------------------
id sender_id recipient_id message conversation_id
1 1 2 "hows it going" 4
2 2 1 "great! hbu" 4
3 1 8 "hey man" 3
4 9 1 "please respond" 2
5 4 6 "goodnight girl" 1
CONVERSATION (id=PK) (userA_id, userB_id = FKs)
-----------------------------------------------
id userA_id userB_id
1 4 6
2 1 9
3 1 8
4 1 2
USERCONVERSATION (id=PK) (userA/B_id, conver_id, lastMsg_id = FKs)
------------------------------------------------------------------
id userA_id userB_id conver_id lastMsg_id isRequest
1 4 6 1 5 False
2 6 4 1 5 False
3 1 9 2 4 True
4 9 1 2 4 True
5 1 8 3 3 False
6 8 1 3 3 False
7 1 2 4 2 False
8 2 1 4 2 False
Indexes:
MESSAGE: index(id),
index(conversation_id, id)
CONVERSATION: index(id),
USERCONVERSATION: index(id),
index(user_id, isRequest),
index(user_id, lastMessage_id),
index(conversation_id)
Queries in application:
The following queries should be performant due to proper indexing as stated above. Please reach out if improvements can be made.
To get latest 20 conversations (including the last message content and the other user's information) for a variable userID:
SELECT T4.userB_id, T4.username, T4.profilePic, T4.conver_id,
T4.message
(
SELECT T1.userB_id, T2.username, T2.profilePic, T1.conversation_id,
T1.lastMessage_id
FROM
(
SELECT userB_id, conversation_id, lastMessage_id
FROM rage.userconversation
WHERE userA_id = {userID}
AND isRequest=False
) AS T1
LEFT JOIN rage.user AS T2 ON T1.userB_id = T2.id AS T3
)
LEFT JOIN rage.message AS T4 ON T1.lastMessage_id = T4.id
ORDER BY T4.id DESC
LIMIT 20
Word explanation: Get 20 of the most recent USERCONVERSATION rows as the lastMessage is stored there. In order to find the 20 most recent for a given user, select all the rows with user_id = userID and sort by lastMessage_id DESC. This is accurate because message_id is auto-incrementing. Along with the last message we need to get some user data (profile picture, username) of the other user in the conversation. We achieve this by left joining.
Result:
RESULT (for userID = 1)
---------------------------------------------------------------
userB_id username profilePic message conver_id
8 John 8.jpg "hey man" 3
2 Daisy 2.jpg "great! hbu" 4
Then when the user taps on a conversation, since we have the conversation_id, we simply:
SELECT * FROM rage.message WHERE conversation_id={conver_id} ORDER BY id DESC LIMIT 20
Hopefully since we indexed (conversation_id, id) the sorting is fast.
You have multiple options here. From what you describe, one of the following two seem appropriate:
A clustered index where the first key is IsRequest.
A partitioning scheme that includes IsRequest.
Another possibility is are two separate tables.
However, because I doubt that your queries are returning 95% of the rows -- or even 5% -- there are undoubtedly other filters. It may be more important to create indexes for those filters rather than for the boolean flag.
Use a composite index. Let's see the entire WHERE clause to give you accurate details.
Example
WHERE IsRequest = True
AND UserId = 12345
would benefit from
INDEX(IsRequest, UserId)
(and it does not matter which order you put the column names in, nor does it matter whether it is True or False.)
Your Example
OR wrecks the use of indexes
UNION between two queries might avoid OR.
No index is useful for the query as you wrote it.
There will be two nested table scans.
Maybe
(I don't know if the following does the same thing.)
( SELECT m1.id, m1.sender_id, m1.recipient_id, m1.message ...
FROM myapp_message AS m1
LEFT JOIN app_message AS m2
ON m1.sender_id = m2.sender_id
AND m1.id < m2.id
WHERE m2.id IS NULL
AND m1.sender_id = {userID}
AND m1.isRequest = False
order by id desc
LIMIT 20
) UNION ALL (
SELECT m1.id, m1.sender_id, m1.recipient_id, m1.message ...
FROM myapp_message AS m1
LEFT JOIN app_message AS m2
ON m1.recipient_id = m2.recipient_id
AND m1.id < m2.id
WHERE m2.id IS NULL
AND m1.recipient_id= {userID}
AND m1.isRequest = False
order by id desc
LIMIT 20
) ORDER BY id DESC LIMIT 20
If you will be paginating, see this: http://mysql.rjweb.org/doc.php/pagination#pagination_and_union
Closer
SELECT m...
FROM
( SELECT xid, MAX(mid) AS mid
FROM
(
( SELECT recipient_id AS xid,
MAX(mid) AS mid -- The last message TO each recipient
FROM WHERE sender_id = 1234 -- FROM the user in question
GROUP BY recipient_id
ORDER BY 2 DESC -- ("2nd column")
LIMIT 20
)
UNION ALL
( SELECT sender_id AS xid,
MAX(mid) AS mid -- The last message FROM each sender
FROM WHERE recipient_id = 1234 -- TO the user
GROUP BY sender_id
ORDER BY 2 DESC
LIMIT 20
)
) AS y
GROUP BY xid -- yes, repeated
ORDER BY mid DESC -- yes, repeated
LIMIT 20 -- yes, repeated
) AS x
JOIN messages AS m ON m.mid = x.mid
With both of these indexes:
INDEX(sender_id, recipient_id, mid)
INDEX(recipient_id, sender_id, mid)
One INDEX is for each subquery. Each is optimal, plus "covering".
(I don't see the relevance of isRequest, so I left it out. I suspect that if the column is needed it can be added to the indexes without loss of efficiency -- if put in a proper position.)
For this query, and perhaps others, it would be good to have another column in the table. It would a unique number, say "conversation_id", that is derived from unique pairs of sender and recipient.
A crude way (but not necessarily the optimal way) is to derive it somehow from distinct values of this ordered pair:
(LEAST(sender_id, recipient_id), GREATEST(recipient_id, sender_id))
Then INDEX(conversation_id, id) would probably be the key to the query being discussed. At that point, we can add back in the discussion of the boolean. I would suspect that this would ultimately be the optimal index:
INDEX(conversation_id, isRequest, id)
(or possibly with the first two columns swapped).
I have a table with 1v1 matches like this:
match_number|winner_id|loser_id
------------+---------+--------
1 | 1 | 2
2 | 2 | 3
3 | 1 | 2
4 | 1 | 4
5 | 4 | 1
and I would like to get something like this:
player|matches_won|matches_lost
------+-----------+------------
1 | 3 | 1
2 | 1 | 2
3 | 0 | 1
4 | 1 | 1
My MySQL Query looks like this
SELECT win_matches."winner_id" player, COUNT(win_matches."winner_id") matches_won, COUNT(lost_matches."loser_id") matches_lost FROM `matches` win_matches
JOIN `matches` lost_matches ON win_matches."winner_id" = lost_matches."winner_id"
I don't know what I did wrong, but the query just loads forever and doesn't return anything
You want to unpivot and then aggregate:
select player_id, sum(is_win), sum(is_loss)
from ((select winner_id as player_id 1 as is_win, 0 as is_loss
from t
) union all
(select loser_id, 0, 1
from t
)
) wl
group by player_id;
Your query is simply not correct. The two counts will produce the same same value -- COUNT(<expression>) returns the number of non-NULL rows for that expression. Your two counts return the same thing.
The reason it is taking forever is because of the Cartesian product problem. If a player has 10 wins and 10 losses, then your query produces 100 rows -- and this gets worse for players who have played more often. Processing all those additional rows takes time.
If you have a separate players table, then correlated subqueries may be the fastest method:
select p.*,
(select count(*) from t where t.winner_id = p.player_id) as num_wins,
(select count(*) from t where t.loser_id = p.player_id) as num_loses
from players p;
However, this requires two indexes for performance on (winner_id) and (loser_id). Note these are separate indexes, not a single compound index.
You are joining the same table twice.
Both the alias win_matches and lost_matches are on the table matches, causing your loop.
You probably don't need separate tables for win and losses, and could do both in the same table by writing one or zero in a column for each.
I don't to change your model too much and make it difficult to understand, so here is a slight modification and what it could look like:
SELECT m."player_id" player,
SUM(m."win") matches_won,
SUM(m."loss") matches_lost
FROM `matches` m
GROUP BY player_id
Without a join, all in the same table with win and loss columns. It looked to me like you wanted to know the number of win and loss per player, which you can do with a group by player and a sum/count.
Im trying to count occurrences of name, but i want each row returned no matter if that name has already been counted. The data looks like;
ID | NAME
1 Peter
2 Simone
3 Otto
4 Cedric
5 Peter
6 Cedric
7 Cedric
The following only returns one row per unique name
select id, first_name, count(first_name)from table group by first_name
ID | FIRST_NAME | count(first_name)
1 Peter 2
2 Simone 1
3 Otto 1
4 Cedric 3
But im trying to return every row, something like
ID | FIRST_NAME | count(first_name)
1 Peter 2
2 Simone 1
3 Otto 1
4 Cedric 3
5 Peter 2
6 Cedric 3
7 Cedric 3
If you are using MySQL version >= 8.0, then you can use window functions:
select id,
first_name,
count(*) over (partition by first_name)
from table
For earlier versions:
select id,
first_name,
(select count(*) from table where first_name = t.first_name)
from table t
You can use a Correlated subquery:
SELECT t1.id,
t1.first_name,
(SELECT COUNT(id)
FROM table t2
WHERE t2.first_name = t1.first_name) AS total_count
FROM table t1
Edit: now that I've seen the other answers, why is joining better than using a correlated subquery? Because a correlated subquery is executed for every row in your table. When you join it, the query is executed just once.
Then you have to join those queries.
select * from
table
inner join (
select first_name, count(first_name) as name_count from table group by first_name
) qcounts on table.first_name = qcounts.first_name
Also note, that in your query you have to remove id from the select clause, since you neither have it in your group by clause nor do you apply an aggregate function on it. Therefore a random row is returned for this column.
It's a good idea to let MySQL remind you of that by activating the only_full_group_by sql mode. To do this you can do
set global sql_mode = concat(##global.sql_mode, 'only_full_group_by');
Been working at this for awhile now and cannot seem to get it optimized. Although it does work, each left joined logs* table is reading every row in the database regardless if it is part of the set it is joined to (user_id's). While it returns correct results as is, this will be a problem as the user base and db as a whole grows.
Some quick background : given an account id there can be any number of computers to it. On each of those computers there can be any number of users linked to it. These user_id's are then linked in the logs tables. Each of these relationships is indexed (account_id, computer_id, user_id) for the necessary tables.
I have put the left joins in subqueries to prevent a cartesian product (a previous issue which subqueries solved).
Query :
SELECT
users.username as username,
computers.computer_name as computer_name,
l1.cnt as cnt1,
l2.cnt as cnt2,
l3.cnt as cnt3,
l4.cnt as cnt4,
l5.cnt as cnt5,
l6.cnt as cnt6
FROM computers
INNER JOIN users
on users.computer_id = computers.computer_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs1
group by user_id
) AS l1
on l1.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs2
group by user_id
) AS l2
on l2.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs3
group by user_id
) AS l3
on l3.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs4
group by user_id
) AS l4
on l4.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs5
group by user_id
) AS l5
on l5.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs6
group by user_id
) AS l6
on l6.user_id = users.user_id
WHERE computers.account_id = :cw_account_id AND computers.status = :cw_status
GROUP BY users.user_id
Plan :
computers 1 PRIMARY ref PRIMARY,unique_filter,status unique_filter 4 const 5 Using where; Using temporary; Using filesort
users 1 PRIMARY ref PRIMARY,unique_filter unique_filter 4 stephen_spcplus_inno.computers.computer_id 1 Using index
<derived2> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 3
logs1 2 DERIVED index user_id user_id 8 33 Using index
<derived3> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 10
logs2 3 DERIVED index user_id user_id 8 101 Using index
<derived4> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 4
logs3 4 DERIVED index user_id user_id 8 41 Using index
<derived5> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 2
logs4 5 DERIVED index user_id user_id 8 28 Using index
<derived6> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 2
logs5 6 DERIVED index user_id user_id 8 28 Using index
<derived7> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 275
logs6 7 DERIVED index user_id user_id 775 27516 Using index
example results :
username computer_name cnt1 cnt2 cnt3 cnt4 cnt5 cnt6
testuser COMPUTER_1 1 2 1 (null) (null) 3
testuser2 COMPUTER_1 (null) (null) (null) (null) (null) (null)
someuser COMPUTER_2 32 83 26 15 28 1157
As an example, for logs6 the plan is reading every row in the database (27516) yet there were only 1160 which 'should' have been joined.
I have tried lots of different things, but cannot get this to operate in an optimized manner. As it is currently the reason all the rows from each table are being read is due to the use of COUNT(*) within each joins subquery... removing this and only the needed rows are joined like I want, however, I do not know how to get the counts then in the same grouped result.
Help from any gurus would be great! Yes, I know I do not have a lot of rows in the db, but I can see the results are correct and see that the full table scans are going to be a problem as well.
EDIT (partial solution) :
I have found a partial solution to this problem, but it requires an additional query to get a list of user_ids. By adding WHERE user_id IN (17,22,23) where these are the user_ids which should be joined... to each log table I get the correct results and the entire table is not scanned.
If anyone knows of a way to make this work without this additional query and where additional please let me know.
I simplified your question to 2 log-tables and played around with it a bit on SQLFiddle.
=> http://sqlfiddle.com/#!2/a99e4a/2
It seems that using a sub-query makes things worse in my example data, but I wonder how it handles things when there are much more records in the tables that don't fit the criteria.
I'd suggest you give it a try and see what comes out. I don't have a MySql db to play around with here and I'd rather not bring SqlFiddle to its knees =)
For simplicity, I will give a quick example of what i am trying to achieve:
Table 1 - Members
ID | Name
--------------------
1 | John
2 | Mike
3 | Sam
Table 1 - Member_Selections
ID | planID
--------------------
1 | 1
1 | 2
1 | 1
2 | 2
2 | 3
3 | 2
3 | 1
Table 3 - Selection_Details
planID | Cost
--------------------
1 | 5
2 | 10
3 | 12
When i run my query, I want to return the sum of the all member selections grouped by member. The issue I face however (e.g. table 2 data) is that some members may have duplicate information within the system by mistake. While we do our best to filter this data up front, sometimes it slips through the cracks so when I make the necessary calls to the system to pull information, I also want to filter this data.
the results SHOULD show:
Results Table
ID | Name | Total_Cost
-----------------------------
1 | John | 15
2 | Mike | 22
3 | Sam | 15
but instead have John as $20 because he has plan ID #1 inserted twice by mistake.
My query is currently:
SELECT
sq.ID, sq.name, SUM(sq.premium) AS total_cost
FROM
(
SELECT
m.id, m.name, g.premium
FROM members m
INNER JOIN member_selections s USING(ID)
INNER JOIN selection_details g USING(planid)
) sq group by sq.agent
Adding DISTINCT s.planID filters the results incorrectly as it will only show a single PlanID 1 sold (even though members 1 and 3 bought it).
Any help is appreciated.
EDIT
There is also another table I forgot to mention which is the agent table (the agent who sold the plans to members).
the final group by statement groups ALL items sold by the agent ID (which turns the final results into a single row).
Perhaps the simplest solution is to put a unique composite key on the member_selections table:
alter table member_selections add unique key ms_key (ID, planID);
which would prevent any records from being added where the unique combo of ID/planID already exist elsewhere in the table. That'd allow only a single (1,1)
comment followup:
just saw your comment about the 'alter ignore...'. That's work fine, but you'd still be left with the bad duplicates in the table. I'd suggest doing the unique key, then manually cleaning up the table. The query I put in the comments should find all the duplicates for you, which you can then weed out by hand. once the table's clean, there'll be no need for the duplicate-handling version of the query.
Use UNIQUE keys to prevent accidental duplicate entries. This will eliminate the problem at the source, instead of when it starts to show symptoms. It also makes later queries easier, because you can count on having a consistent database.
What about:
SELECT
sq.ID, sq.name, SUM(sq.premium) AS total_cost
FROM
(
SELECT
m.id, m.name, g.premium
FROM members m
INNER JOIN
(select distinct ID, PlanID from member_selections) s
USING(ID)
INNER JOIN selection_details g USING(planid)
) sq group by sq.agent
By the way, is there a reason you don't have a primary key on member_selections that will prevent these duplicates from happening in the first place?
You can add a group by clause into the inner query, which groups by all three columns, basically returning only unique rows. (I also changed 'premium' to 'cost' to match your example tables, and dropped the agent part)
SELECT
sq.ID,
sq.name,
SUM(sq.Cost) AS total_cost
FROM
(
SELECT
m.id,
m.name,
g.Cost
FROM
members m
INNER JOIN member_selections s USING(ID)
INNER JOIN selection_details g USING(planid)
GROUP BY
m.ID,
m.NAME,
g.Cost
) sq
group by
sq.ID,
sq.NAME