Should I index a Boolean Field with low 'True' cardinality MySQL? - mysql

I have a MESSAGE table with 1M rows (and growing). Every query for messages involves selecting rows WHERE isRequest = True or WHERE isRequest = False, but never both. The vast majority of my queries are looking for isRequest = False. This table is written to extremely frequently and I need to maintain fast writes (as users love to send messages to each other with low latency). Also note that the MESSAGE table currently has no column indexes other than the primary key.
95% of the rows have isRequest = False and only 5% of rows have isRequest = True. Is it more performant to index the isRequest boolean field in such a scenario?
In addition, I understand that indexing columns consumes memory but is this overhead equivalent for all column data types including, in my case, boolean values?
Update:
After further analysis with #Rick James we have come up with a new table scheme (note all PKs are auto-inc so time relativity is discernible):
MESSAGE (id=PK) (sender_id, recipient_id, conversation_id = FKs)
---------------------------------------------------------------
id sender_id recipient_id message conversation_id
1 1 2 "hows it going" 4
2 2 1 "great! hbu" 4
3 1 8 "hey man" 3
4 9 1 "please respond" 2
5 4 6 "goodnight girl" 1
CONVERSATION (id=PK) (userA_id, userB_id = FKs)
-----------------------------------------------
id userA_id userB_id
1 4 6
2 1 9
3 1 8
4 1 2
USERCONVERSATION (id=PK) (userA/B_id, conver_id, lastMsg_id = FKs)
------------------------------------------------------------------
id userA_id userB_id conver_id lastMsg_id isRequest
1 4 6 1 5 False
2 6 4 1 5 False
3 1 9 2 4 True
4 9 1 2 4 True
5 1 8 3 3 False
6 8 1 3 3 False
7 1 2 4 2 False
8 2 1 4 2 False
Indexes:
MESSAGE: index(id),
index(conversation_id, id)
CONVERSATION: index(id),
USERCONVERSATION: index(id),
index(user_id, isRequest),
index(user_id, lastMessage_id),
index(conversation_id)
Queries in application:
The following queries should be performant due to proper indexing as stated above. Please reach out if improvements can be made.
To get latest 20 conversations (including the last message content and the other user's information) for a variable userID:
SELECT T4.userB_id, T4.username, T4.profilePic, T4.conver_id,
T4.message
(
SELECT T1.userB_id, T2.username, T2.profilePic, T1.conversation_id,
T1.lastMessage_id
FROM
(
SELECT userB_id, conversation_id, lastMessage_id
FROM rage.userconversation
WHERE userA_id = {userID}
AND isRequest=False
) AS T1
LEFT JOIN rage.user AS T2 ON T1.userB_id = T2.id AS T3
)
LEFT JOIN rage.message AS T4 ON T1.lastMessage_id = T4.id
ORDER BY T4.id DESC
LIMIT 20
Word explanation: Get 20 of the most recent USERCONVERSATION rows as the lastMessage is stored there. In order to find the 20 most recent for a given user, select all the rows with user_id = userID and sort by lastMessage_id DESC. This is accurate because message_id is auto-incrementing. Along with the last message we need to get some user data (profile picture, username) of the other user in the conversation. We achieve this by left joining.
Result:
RESULT (for userID = 1)
---------------------------------------------------------------
userB_id username profilePic message conver_id
8 John 8.jpg "hey man" 3
2 Daisy 2.jpg "great! hbu" 4
Then when the user taps on a conversation, since we have the conversation_id, we simply:
SELECT * FROM rage.message WHERE conversation_id={conver_id} ORDER BY id DESC LIMIT 20
Hopefully since we indexed (conversation_id, id) the sorting is fast.

You have multiple options here. From what you describe, one of the following two seem appropriate:
A clustered index where the first key is IsRequest.
A partitioning scheme that includes IsRequest.
Another possibility is are two separate tables.
However, because I doubt that your queries are returning 95% of the rows -- or even 5% -- there are undoubtedly other filters. It may be more important to create indexes for those filters rather than for the boolean flag.

Use a composite index. Let's see the entire WHERE clause to give you accurate details.
Example
WHERE IsRequest = True
AND UserId = 12345
would benefit from
INDEX(IsRequest, UserId)
(and it does not matter which order you put the column names in, nor does it matter whether it is True or False.)
Your Example
OR wrecks the use of indexes
UNION between two queries might avoid OR.
No index is useful for the query as you wrote it.
There will be two nested table scans.
Maybe
(I don't know if the following does the same thing.)
( SELECT m1.id, m1.sender_id, m1.recipient_id, m1.message ...
FROM myapp_message AS m1
LEFT JOIN app_message AS m2
ON m1.sender_id = m2.sender_id
AND m1.id < m2.id
WHERE m2.id IS NULL
AND m1.sender_id = {userID}
AND m1.isRequest = False
order by id desc
LIMIT 20
) UNION ALL (
SELECT m1.id, m1.sender_id, m1.recipient_id, m1.message ...
FROM myapp_message AS m1
LEFT JOIN app_message AS m2
ON m1.recipient_id = m2.recipient_id
AND m1.id < m2.id
WHERE m2.id IS NULL
AND m1.recipient_id= {userID}
AND m1.isRequest = False
order by id desc
LIMIT 20
) ORDER BY id DESC LIMIT 20
If you will be paginating, see this: http://mysql.rjweb.org/doc.php/pagination#pagination_and_union
Closer
SELECT m...
FROM
( SELECT xid, MAX(mid) AS mid
FROM
(
( SELECT recipient_id AS xid,
MAX(mid) AS mid -- The last message TO each recipient
FROM WHERE sender_id = 1234 -- FROM the user in question
GROUP BY recipient_id
ORDER BY 2 DESC -- ("2nd column")
LIMIT 20
)
UNION ALL
( SELECT sender_id AS xid,
MAX(mid) AS mid -- The last message FROM each sender
FROM WHERE recipient_id = 1234 -- TO the user
GROUP BY sender_id
ORDER BY 2 DESC
LIMIT 20
)
) AS y
GROUP BY xid -- yes, repeated
ORDER BY mid DESC -- yes, repeated
LIMIT 20 -- yes, repeated
) AS x
JOIN messages AS m ON m.mid = x.mid
With both of these indexes:
INDEX(sender_id, recipient_id, mid)
INDEX(recipient_id, sender_id, mid)
One INDEX is for each subquery. Each is optimal, plus "covering".
(I don't see the relevance of isRequest, so I left it out. I suspect that if the column is needed it can be added to the indexes without loss of efficiency -- if put in a proper position.)

For this query, and perhaps others, it would be good to have another column in the table. It would a unique number, say "conversation_id", that is derived from unique pairs of sender and recipient.
A crude way (but not necessarily the optimal way) is to derive it somehow from distinct values of this ordered pair:
(LEAST(sender_id, recipient_id), GREATEST(recipient_id, sender_id))
Then INDEX(conversation_id, id) would probably be the key to the query being discussed. At that point, we can add back in the discussion of the boolean. I would suspect that this would ultimately be the optimal index:
INDEX(conversation_id, isRequest, id)
(or possibly with the first two columns swapped).

Related

Count first occurence with column value ordered by another column

I have an assigns table with the following columns:
id - int
id_lead - int
id_source - int
date_assigned - int (this represents a unix timestamp)
Now, lets say I have the following data in this table:
id id_lead id_source date_assigned
1 20 5 1462544612
2 20 6 1462544624
3 22 6 1462544615
4 22 5 1462544626
5 22 7 1462544632
6 25 6 1462544614
7 25 8 1462544621
Now, lets say I want to get a count of the rows whose id_source is 6, and is the first entry for each lead (sorted by date_assigned asc).
So in this case, the count would = 2, because there are 2 leads (id_lead 22 and 25) whose first id_source is 6.
How would I write this query so that it is fast and would work fine as a subquery select? I was thinking something like this which doesn't work:
select count(*) from `assigns` where `id_source`=6 order by `date_assigned` asc limit 1
I have no idea how to write this query in an optimal way. Any help would be appreciated.
Pseudocode:
select rows
with a.id_source = 6
but only if
there do not exist any row
with same id_lead
and smaller date_assigned
Translate it to SQL
select * -- select rows
from assigns a
where a.id_source = 6 -- with a.id_source = 6
and not exists ( -- but only if there do not exist any row
select 1
from assigns a1
where a1.id_lead = a.id_lead -- with same id_lead
and a1.date_assigned < a.date_assigned -- and smaller date_assigned
)
Now replace select * with select count(*) and you'll get your result.
http://sqlfiddle.com/#!9/3dc0f5/7
Update:
The NOT-EXIST query can be rewritten to an excluding LEFT JOIN query:
select count(*)
from assigns a
left join assigns a1
on a1.id_lead = a.id_lead
and a1.date_assigned < a.date_assigned
where a.id_source = 6
and a1.id_lead is null
If you want to get the count for all values of id_source, the folowing query might be the fastest:
select a.id_source, count(1)
from (
select a1.id_lead, min(a1.date_assigned) date_assigned
from assigns a1
group by a1.id_lead
) a1
join assigns a
on a.id_lead = a1.id_lead
and a.date_assigned = a1.date_assigned
group by a.id_source
You still can replace group by a.id_source with where a.id_source = 6.
The queries need indexes on assigns(id_source) and assigns(id_lead, date_assigned).
Simple query for that would be
check here http://sqlfiddle.com/#!9/8666e0/7
select count(*) from
(select * from assigns group by id_lead )t
where t.id_source=6

Select most recent record based on two conditions

I have user1 who exchanged messages with user2 and user4 (these parameters are known). I now want to select the latest sent or received message for each conversation (i.e. LIMIT 1 for each conversation).
SQLFiddle
Currently my query returns all messages for all conversations:
SELECT *
FROM message
WHERE (toUserID IN (2,4) AND userID = 1)
OR (userID IN (2,4) AND toUserID = 1)
ORDER BY message.time DESC
The returned rows should be messageID 3 and 6.
Assuming that higher id values indicate more recent messages, you can do this:
Find all messages that involve user 1
Group the results by the other user id
Get the maximum message id per group
SELECT *
FROM message
WHERE messageID IN (
SELECT MAX(messageID)
FROM message
WHERE userID = 1 -- optionally filter by the other user
OR toUserID = 1 -- optionally filter by the other user
GROUP BY CASE WHEN userID = 1 THEN toUserID ELSE userID END
)
ORDER BY messageID DESC
Updated SQLFiddle
You can do this easily by separating it into two queries with ORDER BY and LIMIT then joining them with UNION:
(SELECT *
FROM message
WHERE (toUserID IN (2,4) AND userID = 1)
ORDER BY message.time DESC
LIMIT 1)
UNION
(SELECT *
FROM message
WHERE (userID IN (2,4) AND toUserID = 1)
ORDER BY message.time DESC
LIMIT 1)
The parenthesis are important here, and this returns messages 2 and 6, which seems correct, not 3 and 6.
It also seems like you could use UNION ALL for performance instead of UNION because there won't be duplicates between the two queries, but it's better if you decide that.
Here's your data:
MESSAGEID USERID TOUSERID MESSAGE TIME
1 1 2 nachricht 1 123
2 1 2 nachricht 2 124
3 2 1 nachricht 3 125
4 3 2 nachricht wrong 1263
5 2 4 nachricht wrong 1261
6 4 1 nachricht sandra 126
The below works as required:
SELECT m1.*
FROM Message m1
LEFT JOIN Message m2
ON LEAST(m1.toUserID, m1.userID) = LEAST(m2.toUserID, m2.userID)
AND GREATEST(m1.toUserID, m1.userID) = GREATEST(m2.toUserID, m2.userID)
AND m2.time > m1.Time
WHERE m2.MessageID IS NULL
AND ( (m1.toUserID IN (2,4) AND m1.userID = 1)
OR (m1.userID IN (2,4) AND m1.toUserID = 1)
);
To simplify how this works, imagine you just wanted the latest message sent by userid 1, rather than having to match the to/from tuples as this adds clutter to the query that doesn't help. To get this I would use:
SELECT m1.*
FROM Message AS m1
LEFT JOIN Message AS m2
ON m2.UserID = m1.UserID
AND m2.time > m1.time
WHERE m1.UserID = 1
AND m2.MessageID IS NULL;
So, we are joining similar messages, stipulating that the second message (m2) has a greater time than the first, where m2 is null it means there is not a similar message with a later time, therefore m2 is the latest message.
Exactly the principal has been applied in the solution, but we have a more complicated join to link conversations.
I have used LEAST and GREATEST in the join, the theory being that since you have 2 members in your tuple (UserID, ToUserID), then in any combination the greatest and the least will be the same, e.g.:
From/To | Greatest | Least |
--------+-----------+-------+
1, 2 | 2 | 1 |
2, 1 | 2 | 1 |
1, 4 | 4 | 1 |
4, 1 | 4 | 1 |
4, 2 | 4 | 2 |
2, 4 | 4 | 2 |
As you can see, in similar From/To the greatest and the least will be the same, so you can use this to join the table to itself.
There are two parts of your query in the following order:
You want the latest outgoing or incoming message for a conversation between two users
You want these latest messages for two different pairs of users, i.e. conversations.
So, lets get the latest message for a conversation between UserID a and UserID b:
SELECT *
FROM message
WHERE (toUserID, userID) IN ((a, b), (b, a))
ORDER BY message.time DESC
LIMIT 1
Then you want these to be combined for the two conversations between UserIDs 1 and 2 and UserIDs 1 and 4. This is where the union comes into play (we do not need to check for duplicates, thus we use UNION ALL, thanks to Marcus Adams, who brought that up first).
So a complete and straightforward solution would be:
(SELECT *
FROM message
WHERE (toUserID, userID) IN ((2, 1), (1, 2))
ORDER BY message.time DESC
LIMIT 1)
UNION ALL
(SELECT *
FROM message
WHERE (toUserID, userID) IN ((4, 1), (1, 4))
ORDER BY message.time DESC
LIMIT 1)
And as expected, you get message 3 and 6 in your SQLFiddle.

MySQL compare, count and order by

I am having trouble with understanding how to solve a seemingly simple problem of sorting results.
I want to compare how many other users like the same fruits as like the user with ID 1, a count who has the most matches and display the results in descending order.
users:
1 jack
2 john
3 jim
fruits:
id, title
1 apple
2 banana
3 orange
4 pear
5 mango
relations: 2 indexes (user_id, fruit_id) and (fruit_id, user_id)
user_id, fruit_id
1 1
1 2
1 5
2 1
2 2
2 4
3 3
3 1
expected results: (comparing with Jack's favourite fruits (user_id=1))
user_id, count
1 3
2 2
3 1
Query:
SELECT user_id, COUNT(*) AS count FROM relations
WHERE fruit_id IN (SELECT fruit_id FROM relations WHERE user_id=1)
GROUP BY user_id
HAVING count>=2
More "optimized" query:
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE EXISTS (SELECT 1 FROM relations WHERE user_id=1 and r.fruit_id=fruit_id)
GROUP BY user_id
HAVING count>=2
2 is the minimum number of matches. (required for the future)
explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY r index NULL uid 8 NULL 15 Using where; Using index
2 DEPENDENT SUBQUERY relations eq_ref xox,uid xox 8 r.relations,const 1 Using where; Using index
Everything is working fine, until I try to use ORDER BY count DESC
Then I see: Using temporary; Using filesort
I don't want to use temporary tables or filesort. Because in the future, the database should be under high load.
I know, this is how SQL is defined and how it operates. But I can not figure out how to do it in other way? Without temporary tables and filesort.
I need to show the users who has the most matches first.
Please, help me out.
UPDATE:
I did some tests with the query from Walker Farrow (which is still uses a filesort).
20,000 rows - avg 0.05 seconds
120,000 0.20 sec.
1,100,000 2.9 sec.
disappointing results.
It would be possible to change the tables structure, but, with such a counting and sorting - I don't know how.
Is there any suggestions on how this can be done?
Probably the best way to do this would be to create a subquery and then order by in the outer-query, something like this:
select *
from (
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE EXISTS (SELECT 1 FROM relations WHERE user_id=1 and r.fruit_id=fruit_id)
GROUP BY user_id
HAVING count(*)>=2
) x
order by count desc
Also, I don't know why you need to add exists. Can you just say the following:
select *
from (
SELECT user_id, COUNT(*) AS count FROM relations r
WHERE user_id=1
GROUP BY user_id
HAVING count(*)>=2
) x
order by count desc
?
I am not sure, maybe I am missing something. HOpe that helps!

MySQL select rand and exclude users with condition

I need to select random user_id from "user" table, and completely exclude any user_id if current user have any "ongoing" battles with him battles.status
Query:
SELECT user.id
FROM user
LEFT JOIN battles b ON b.uid = user.id AND b.status <> 'ongoing'
WHERE user.id <> 1
ORDER BY RAND( )
LIMIT 1
But the query is not sufficient, because a user can have multiple battles with specific other users, one of them "ongoing" and the others "finished",
My query should select users from the "finished" row.
Tables structure:
user table:
id name
1 John
2 Sarah
3 Jack
4 Andy
5 Rio
battles table:
id uid uid2 status
1 1 2 finished
2 1 2 ongoing
3 2 3 ongoing
4 1 4 finished
5 3 5 finished
If "my" id = "1",
I want to completely exclude any user I have ongoing battle with him, like "2" in the above case and accept all other ids (i.e.3,4 and 5)
You probably want something along the lines of this:
SELECT foe.*
-- Select yourself and join all other users to find potential foes
FROM `user` AS me
INNER JOIN `user` AS foe
ON (me.id <> foe.id)
-- Here we select the active user
WHERE me.`id` = 1
-- Now we exclude foes we have ongoing battles with
-- (your id could be in either uid or uid2)
AND foe.`id` NOT IN (
SELECT `uid` FROM `battles`
WHERE `uid2` = me.`id` AND `status` = 'ongoing'
UNION ALL
SELECT `uid2` FROM `battles`
WHERE `uid` = me.`id` AND `status` = 'ongoing'
);
This will return a list of users which you do not currently have ongoing battles with. You can customise this to return just one of them using LIMIT and random ordering like in your example.

Mysql combine rows?

I have a table with itemid|fieldid|value and i'm trying to setup a query that will combine some data and return a mathc percentage along with the result. for example, some data could be
itemid fieldid value
19 193 1
45 193 1
37 201 6
25 201 1
45 201 6
19 201 6
19 201 5
Now i want for example, to get all the rows with fieldid = 193 AND value = 1 as well as the rows with fieldid = 201 AND value = 6. The ideal result would be something like :
itemid, percentage getting 100% for all itemids which match both conditions and 50% for all that match one. I have this query working for doing the above over multiple columns but it will not work here
select id,user_class,admin, (
if (admin = 1,1,0)+
if (user_class = 'SA',1,0)
)/2*100 as the_percent
from users
WHERE
admin = 1 OR user_class = 'P'
GROUP BY id
order by the_percent DESC
Also i got the following for absolute matching
SELECT users.id FROM users WHERE
users.id IN
(
SELECT DISTINCT itemid FROM extra_field_values
INNER JOIN (SELECT DISTINCT itemid FROM extra_field_values WHERE fieldid = 201 AND value = 6 ) a1 USING (itemid)
INNER JOIN (SELECT DISTINCT itemid FROM extra_field_values WHERE fieldid = 193 AND value = 1 ) a2 USING (itemid)
)
but combining the two seems to be a bit of a puzzle to me
I think you might be able to make use of a UNION and selecting from a table subquery to make this happen. Perhaps something along the lines of:
SELECT itemid, count(*)/2*100 AS percent FROM
( SELECT itemid FROM extra_field_values WHERE fieldid = 201 AND value = 6
UNION ALL
SELECT itemid FROM extra_field_values WHERE fieldid = 193 AND value = 1 ) AS t
GROUP BY itemid;
It's been a while since I've done anything complex in mysql, and I threw this together in notepad so my syntax could be off :) But basically we create a view of the matching ids, then from that table create our statistics. (You'll also want to do some performance evaluation as well to see how it stacks up compared to doing multiple queries as well).