How to rewrite UNION with LEFT JOIN more efficiently - mysql

I have two tables...one that registers users and one that checks in users. A user will always have a single entry in the register table but a user may have 0 or multiple entries in the checkin table. For a raffle selector, I wrote a query that is picking 1 entry from the register table and then 1 entry from the checkin table - each sub query picks a random entry so long as that userID does not exist in a 3rd table that stores the raffle winners. After the two entries are returned than it randomly selects one of the two returned entries as the winnner.
However, I believe there should be a more efficient way of writing this so its ONLY picking an entry once....not picking two entries and then picking one of the two.
It took me quite a while to figure out how to correctly write the below query as I am not proficient in mysql at all. The query works and seems to work efficiently, but I believe there should be a better way of writing it that also consolidates the amount of query code.
Hoping someone here can help or advise.
Table note: clubusers/clubHistory have multiple overlapping columns but the tables are not the same:
register = clubUsers
checkins = clubHistory
winners = clubRaffleWinners
SELECT * FROM (
(SELECT ch.user_ID,ch.clID FROM clubHistory AS ch
LEFT OUTER JOIN clubRaffleWinners AS cr1 ON
ch.user_ID=cr1.user_ID
AND cr1.cID=1157
AND cr1.rafID=18
AND cr1.crID=1001
AND cr1.ceID=1167
AND cr1.chDate1='2022-06-04'
WHERE
ch.cID=1157
AND ch.crID=1001
AND ch.ceID=1167
AND ch.chDate='2022-06-04'
AND cr1.user_ID IS NULL
GROUP BY ch.user_ID ORDER BY RAND() LIMIT 1
)
UNION
(SELECT cu.user_ID,cu.clID FROM clubUsers AS cu
LEFT OUTER JOIN clubRaffleWinners AS cr2 ON
cu.user_ID=cr2.user_ID
AND cr2.cID=1157
AND cr2.rafID=18
AND cr2.crID=1001
AND cr2.ceID=1167
AND cr2.chDate1='2022-06-04'
WHERE
cu.cID=1157
AND cu.crID=1001
AND cu.ceID=1167
AND cu.calDate<='2022-06-04'
AND cr2.user_ID IS NULL
GROUP BY cu.user_ID ORDER BY RAND() LIMIT 1
)
) AS foo order by RAND() LIMIT 1 ;
UPDATE:
As #JettoMartinez points out below, my current query could in fact randomly return the same user from each table so the final returned entry would just be the same user. I didn't realize this in my struggles just to get the above query to work. Thus my original OP asking for a more optimized query simply selecting a single random entry from both tables (where that user is not already in the winners table) is applicable for yet another reason.

There are two ways I can think of (Do note that since I don't fully understand the tables, I'm not using all the conditions you used in your JOIN statements, meaning it might need more work):
Using a exclusive subquery:
SELECT
cu.user_ID,
cu.clID,
ch.cID
FROM
clubUsers cu
LEFT JOIN clubHistory ch ON ch.user_ID = cu.user_ID
WHERE user_ID NOT IN (
SELECT
user_ID
FROM
clubRaffleWinners
WHERE
-- other conditions
)
ORDER BY RAND() LIMIT 1;
Using a LEFT "OUTER" JOIN, as you asked for:
SELECT
cu.user_ID,
cu.clID,
ch.cID -- Or any relevant field from clubHistory, really
FROM
clubUsers cu
LEFT JOIN clubHistory ch ON ch.user_ID = cu.user_ID
LEFT JOIN clubRaffleWinners cr ON cr.user_ID = cu.user_ID
AND ... -- other conditions to ensure uniqueness
AND ... -- that could also be in the WHERE part
WHERE
cr.user_ID IS NULL -- this will filter out the INNER part of the JOIN
ORDER BY RAND() LIMIT 1;
I don't have a dataset to properly test this queries, so please take them as a concept. I also didn't queried in clubHistory since I honestly don't see the point of doing so. Interpolating clubRaggleWinners to clubUsers seems enough for me.
EDIT
Since the user_ID in clubHistory is relevant to the raffle, I added a LEFT JOIN to it and added a field from said table in the SELECT statement, so that the user_id repeats once per entry in clubHistory plus the row of clubUsers, meaning that every user has 1 + number of entries / number of users + number of entries - number of winners chances to win.
This logic can be applied to the first query with a subquery too, and if the added field needs to be out, the query could be wrapped in a CTE or a subquery.

From what you are describing, and I want to make sure I understand.
Every registered person is qualified 1 entry.
However, each time they have checked in, they get 1 entry for each time they checked in. So, for someone registered and has NEVER checked-in, they get 1 entry. But if someone registered, and checked in 3 times, they would get a total of JUST the 3 times they checked in, vs 4 just for being registered.
Regardless of who is POSSIBLE, you want to EXCLUDE all people who have already been a winner in the raffle.
You SHOULD be able to get results from this below. Since the columns appear to be the same filtering on the cID, crID, ceID and Date, I have the primary FROM based on the registered clubUsers.
From that, a left-join to the clubHistory will either allow that person's ID to be returned once if only registered, OR multiple times based on the times checked in such as the example.
From the given user, I am also directly left-joining to the raffle winning history on the same criteria. If its the same criteria to the club history join, and the same criteria to the raffle (with exception of rafID = 18), appearing to indicate a specific raffle being drawn for, If the person is found, or not, the final WHERE accounts to exclude if its the single entry, or multiple entries via the IS NULL test.
The query will return all entries single or multiple, that have not already won in the order by RAND() qualifier, and apply a single LIMIT 1 to get the final winner. I dont know why you needed what appeared to be the clubhouse ID when you only really care about WHO won, without any regard to being a clubhouse history entry or not.
SELECT
cu.user_ID
FROM
clubUsers AS cu
LEFT JOIN clubHistory ch
on cu.user_ID = ch.user_ID
AND cu.cID = ch.cID
AND cu.crID = ch.crID
AND cu.ceID = ch.ceID
AND ch.chDate = '2022-06-04'
LEFT JOIN clubRaffleWinners AS crw
ON cu.user_ID = crw.user_ID
AND cu.cID = crw.cID
AND cu.crID = crw.crID
AND cu.ceID = crw.ceID
AND crw.chDate1 = '2022-06-04'
AND crw.rafID = 18
WHERE
cu.cID = 1157
AND cu.crID = 1001
AND cu.ceID = 1167
AND cu.calDate <= '2022-06-04'
AND crw.user_id IS NULL
order by
RAND()
LIMIT 1
For performance purposes, I would ensure the following indexes
table index
clubUsers ( cid, crID, ceID, calDate, user_id )
clubHistory ( user_id, cID, crID, ceID, chDate )
clubRaffleWinners ( user_id, cID, crID, ceID, chDate1, rafID )

(Just a Comment, but need formatting.)
I would start by trying to put these 4 values in a single table, not repeated across 3 tables:
cu.cID=1157
AND cu.crID=1001
AND cu.ceID=1167
AND cu.calDate<='2022-06-04'
Please provide SHOW CREATE TABLE for each table; then I can assess whether the recommended indexes make sense.

Related

Does JOIN or LEFT JOIN keep checking in a SELECT query?

I have a JOIN query but I need to optimize it for performance.
For example, in this query:
"SELECT id FROM users WHERE id = :id"
Since there is no LIMIT 1 at the end of the query, that select query will keep searching. If I add LIMIT 1 to the end of that query, it will select only one and stop searching for more.
Here is my question and query:
"SELECT messages.text, users.name
FROM messages
LEFT JOIN users
ON messages.from_id = users.id
WHERE messages.user_id = :user_id"
In the JOIN users ON messages.from_id = users.id part, since there is only 1 user with that ID, will it keep searching after it has found that query? If it does, how can I optimize it so that it only searches for 1 row?
SELECT id FROM users WHERE id = :id
If there is no index on id, the entire table is scanned.
If there is a UNIQUE or PRIMARY KEY on id, only one row will be checked.
If there is a plain INDEX, it will scan from the first match until it finds an id that does not match.
For this:
SELECT m.text, u.name
FROM messages AS m
LEFT JOIN users AS u ON m.from_id = u.id
WHERE m.user_id = :user_id
It will do a "Nested Loop Join":
Find the occurrence(s) in messages that satisfy m.user_id = :user_id (see above).
For each such row, reach into users based on the ON clause.
There may be multiple rows (again, depending the index or lack of such).
So, your question "how can I optimize it so that it only searches for 1 row" is answered:
If there can only be one row, declare it UNIQUE.
If there are sometimes more than on, then INDEX. But don't worry about checking for an extra row; it is not that costly.
You say "only 1 user with that ID", but fail to specify which id in which table.
But that is not the end of the story...
LEFT JOIN may get turned into JOIN. In that case, users may be the first table to look at. Note also that the Optimizer is smart enough to deduce that you want u.id = :user_id. Anyway, the NLJ will start with users, then reach into messages. Again, the types of indexes are important.
Please provide SHOW CREATE TABLE for both tables. Then I can condense the answer to the relevant parts. Please provide EXPLAIN SELECT ... for confirmation of what I am saying.

Running a check within a SQL query (Maybe a subquery?)

I have a simple laptop testing booking system with 6 laptops, named Laptop01 to 06 that each have three allocated time slots.
A user is is able to select these time slots if the slot is not booked or if the booking has been cancelled/declined.
While I have working code, I've realised a fatal error that causes a cancelled/declined slot to duplicate.
Let me explain...
event_information - Holds the booking event information (only ID is
needed for this example)
event_machine_time - This hold all the
laptops, with three rows per laptop with the unique timings available
to choose from
event_booking - This holds the actual booking, which
then links to another candidate database, not included here
I then run a simple query that joins everything together and (I thought) identifies the booked events:
SELECT machine_laptop, machine_name, B.id AS m_id, C.id AS c_id, C.confirmed AS c_confirmed, C.live AS c_live,
(C.id IS NOT NULL AND C.confirmed !=2 AND C.live !=0) AS booked
FROM event_information A
INNER JOIN event_machine_time B ON ( 1 =1 )
LEFT JOIN event_booking C on (B.id = C.machine_time_id and A.id = C.information_id )
WHERE A.id = :id
ORDER BY `B`.`id` DESC
booked is checking if confirmed isn't 2 - which means the booking has been cancelled/declined (0 - not confirmed, 1 - confirmed) and live is checking for deletion (0 - deleted, 1 - not deleted).
However if a person either gets deleted (live - 0) or cancels/declines (confirmed - 2) then in my front end slot selector dropdown it will add an extra slot as the booked column is still 0, as shown below:
This allows the user to then choose from two slots at the same time, meaning double bookings occur.
I now know that using a Join is the wrong thing to do, and I'm presuming that I need to run a subquery, but I'm not an SQL expert and I would love some help to find examples of similar 'second queries' that I can learn from.
Also apologies if my terminology is wrong.
EDIT:
As requested I've included the output:
Second edit and conclusion:
In the end I managed to craft a solution together using a sub query to remove the cancelled/declined bookings before the output, then use a Group By to only display one of each timing. This most likely isn't the best way, but it worked for me.
SELECT machine_laptop, machine_name, B.id AS m_id, C.id AS c_id, C.confirmed AS c_confirmed, C.live AS c_live, B.start_time AS b_start_time, (
C.id IS NOT NULL
AND C.confirmed !=2
AND C.live !=0
) AS booked
FROM event_information A
INNER JOIN event_machine_time B ON (1=1)
LEFT JOIN (SELECT * FROM event_booking WHERE confirmed <> '2' AND live <> '0') AS C ON ( B.id = C.machine_time_id AND A.id = C.information_id )
WHERE A.id = :id
GROUP BY m_id
ORDER BY machine_name ASC, b_start_time ASC
Thank you for all your input.
Try below :
SELECT machine_laptop, machine_name, B.id AS m_id, C.id AS c_id, C.confirmed
AS c_confirmed, C.live AS c_live,
(C.id IS NOT NULL AND C.confirmed !=2 AND C.live !=0) AS booked
FROM event_information A
LEFT JOIN event_booking C ON A.id = C.information_id
RIGHT JOIN event_machine_time B ON B.id = C.machine_time_id
WHERE A.id = :id
ORDER BY `B`.`id` DESC
If you make the event_booking (B) as starting point for your query, you can see that there's no need to use pull all rows and columns from A and C. Intead you can join on matching rows directly. But as I can't even properly grasp what your query is trying to achieve, I have couple of questions first:
While this may work it's actually something that's not under your control nor defined by you. Some more strict mode would politely tell you to specify which aliased table you're referring to in your SELECT, as this
SELECT machine_laptop, machine_name -- combined with
FROM event_information A
actually doesn't make sense and the only reason why it's working is that you're leveraging on MySQL's optimisations. In addition to that you're trying to do table joins in a mixed mode (meaning that you use both JOIN and WHERE tA.colX=tB.colY methods. This makes it really difficult to follow.
INNER JOIN event_machine_time B ON ( 1 =1 )
Um? What exactly is the e purpose of this? As far as I can tell this will only cause it to JOIN both full tables, only to later filter the result using WHERE.
Furthermore, are you even using primary keys? Your condition includes C.id IS NOT NULL while primary keys can't even contain NULLs (as NULL is third boolean state in SQL land. There is True, False, and Null (meaning Undefined, which obviously couldn't be used in primary key, as primary key must be unique and Undefined value can be anything or nothing - ergo it's violating the uniqueness requirement). So I'm assuming you're actually using this NULL check because the temp table during JOIN seems to contain them?
EDIT:
Try to split this into two parts, where you first join 2 tables, and then join third table with the result.
I suggest you go briefly over What is the difference between "INNER JOIN" and "OUTER JOIN"? - as this is pretty great post and clarifies many aspects.
For startest I'd go with something like:
SELECT
<i.cols>,
<b.cols>,
<mt.cols>,
IF(b.confirmed !=2 AND b.live !=0, True, False) sa booked
FROM
event_booking b
LEFT JOIN
event_information i ON b.information_id = i.id
LEFT JOIN
event_machine_time mt ON b.machine_time_id = mt.id
WHERE <conditions>
Later I'd change LEFT JOIN into something more appropriate. However bear in mind that INNER JOIN is only useful if you're 100% sure that there rows returned from joined table columns are unique.
Can there even be 1:n, n:1 relationship between i and b tables? I'd assume there couldn't be multiple bookings to same event info (n:1), nor there'd be so that event information is the same for multiple events ? (1:n)

Querying a large table using mysql

I manage a property website. I have a table with banned users (small table) and a table called advert_views which keeps track of each listing that each user views (currently 1.3m lines and growing). The advert_views table alsio takes note of the IP address for every advert viewed).
I want to get the IP addresses used by the banned users and check if any of these banned users have opened new accounts. I ran the following query:
SELECT adviews.user_id AS 'banned user_id',
adviews.client_ip AS 'IPs used by banned users',
adviews2.user_id AS 'banned users that opened a new account'
FROM banned_users
LEFT JOIN users on users.email_address = banned_users.email_address #since I don't store the user_id in banned_users
LEFT JOIN advert_views adviews ON adviews.user_id = users.id AND adviews.user_id IS NOT NULL # users may view listings when not logged in but they have restricted access to the information on the listing
LEFT JOIN (SELECT client_ip,
user_id
FROM advert_views
WHERE user_id IS NOT NULL
) adviews2
ON adviews2.client_ip = adviews.client_ip
WHERE banned_users.rec_status = 1 and adviews.user_id <> adviews2.user_id
GROUP BY adviews2.user_id
I applied an index on the advert_views table and the users table as per below:
enter image description here
My query takes half an hour to execute. Is there a way how to improve my query speed?
Thanks!
Chris
First of all: Why do you outer join the tables? Or better: Why do you try to outer join the tables? A left join is meant to get data from a table even when there is no match. But then your results could contain rows with all values null. (That doesn't happen though, because adviews.user_id <> adviews2.user_id in your where clause dismisses all outer-joined rows.) Don't give the DBMS more work to do than necessary. If you want inner joins, then don't outer join. (Though the difference in execution time won't be huge.)
Next: You select from banned_users, but you only use it to check existence. You shouldn't do this. Use an EXISTS or IN clause instead. (This is mainly for readability and in order not to produce duplicate results. This probably won't speed things up.)
SELECT av1.user_id AS 'banned user_id',
av2.client_ip AS 'IPs used by banned users',
av2.user_id AS 'banned users that opened a new account'
FROM adviews av1
JOIN adviews av2 ON av2.client_ip = av1.client_ip AND av2.user_id <> av1.user_id
WHERE av1.user_id IN
(
SELECT user_id
FROM users
WHERE email_address IN (select email_address from banned_users where rec_status = 1)
)
GROUP BY av2.user_id;
You may replace the inner IN clause with a join. It's mostly a matter of personal preference, but it is also that in the past MySQL sometimes didn't perform well on IN clauses, so many people made it a habit to join instead.
WHERE av1.user_id IN
(
SELECT u.user_id
FROM users u
JOIN banned_users bu ON bu.email_address = u.email_address
WHERE bu.rec_status = 1
)
At last consider removing the GROUP BY clause. It reduces your results to one row per reusing user_id, showing one of its related banned user_ids (arbitrarily chosen in case there is more than one). I don't know your tables. Are you getting many records per reusing user_id? If not, remove the clause.
As to indexes I suggest:
banned_users(rec_status, email_address)
users(email_address, user_id)
adviews(user_id, client_ip)
adviews(client_ip, user_id)

Join another table with multiple rows to another table's single result

I currently select a single row (a post):
SELECT s.id AS id,s.date,s.title,s.views,s.image,s.width,s.description,u.id AS userId,u.username,u.display_name,u.avatar,
(select count(*) from comments where item_id = s.id and type = 1) as numComments,
(select count(*) from likes where item_id = s.id and type = 1) as numLikes,
(select avg(value) from ratings where showcase_id = s.id) as average,
(select count(*) from ratings where showcase_id = s.id) as total
FROM showcase AS s
INNER JOIN users AS u ON s.user_id = u.id
WHERE s.id = :id
LIMIT 5
Then get comments for that post in a separate query:
SELECT c.id as c_id,c.text,c.date,u.id as u_id,u.username,u.display_name,u.avatar
FROM comments as c
INNER JOIN users as u ON c.user_id = u.id
WHERE item_id = :item_id AND type = :type
:id and :item_id are the same. However, the comments return multiple rows whereas the first query returns one row - is there a way to join the comments to the first query or is the current way fine?
It really depends on your application.
If we are talking about a few records returned from a small or medium table, and if the query is executed just a few times a day, then it wouldn't matter much if:
you work with two record sets (two different queries are executed
and then their results are put together);
you join the two queries, copying the post information for each record from the comments query;
you build a XML with the comments and join it to the record returned in the first query (the post record).
Another factor to take in consideration is whether the post and it's comments are displayed at the same time. If this is NOT the case and the comments are not visible at first and displayed only after some action like the click of a button, then you should chose the 1st option above, for performance reasons.
But if both the post information it's comments must be displayed at the same time, then you should chose one of the 3 options above. Which one is more of a personal favorite in modeling your application data structures and it's database access layer.
Now, if the volume of data may get huge, then you should dig a little deepen and run some simulations to find the query(ies) that give you the optimal performance.

join on sub query returns fails

Trying to join a table "fab_qouta.qoutatype" to at value inside a sub query "fab_status_members.statustype" but it returns nothing.
If I join the 2 tables directly in a query the result is correct.
Like this:
select statustype, takst
from
fab_status_members AS sm
join fab_quota as fq
ON fq.quotatype = sm.statustype
So I must be doing something wrong, here the sub query code, any help appreciated
select
ju.id,
name,
statustype,
takst
from jos_users AS ju
join
( SELECT sm.Members AS MemberId, MaxDate , st.statustype
FROM fab_status_type AS st
JOIN fab_status_members AS sm
ON (st.id = sm.statustype) -- tabels are joined
JOIN
( SELECT members, MAX(pr_dato) AS MaxDate -- choose members and Maxdate from
FROM fab_status_members
WHERE pr_dato <= '2011-07-01'
GROUP BY members
)
AS sq
ON (sm.members = sq.members AND sm.pr_dato = sq.MaxDate)
) as TT
ON ju.id = TT.Memberid
join fab_quota as fq
ON fq.quotatype = TT.statustype
GROUP BY id
Guess the problem is in the line: join fab_quota as fq ON fq.quotatype = TT.statustype
But I can't seem to look through it :-(
Best regards
Thomas
It looks like you are joining down to the lowest combination of per member with their respective maximum pr_dato value for given date. I would pull THIS to the FIRST query position instead of being buried, then re-join it to the rest...
select STRAIGHT_JOIN
ju.id,
ju.name,
fst.statustype,
takst
from
( SELECT
members,
MAX(pr_dato) AS MaxDate
FROM
fab_status_members
WHERE
pr_dato <= '2011-07-01'
GROUP BY
members ) MaxDatePerMember
JOIN jos_users ju
on MaxDatePerMember.members = ju.ID
JOIN fab_status_members fsm
on MaxDatePerMember.members = fsm.members
AND MaxDatePerMember.MaxDate = fsm.pr_dato
JOIN fab_status_type fst
on fsm.statustype = fst.id
JOIN fab_quota as fq
on fst.statusType = fq.quotaType
I THINK I have all of what you want, and let me reiterate in simple words what I think you want. Each member can have multiple status entries (via Fab_Status_Members). You are looking for all members and what their MOST RECENT Status is as of a particular date. This is the first query.
From that, whatever users qualify, I'm joining to the user table to get their name info (first join).
Now, back to the complex part. From the first query that determined the most recent date status activity, re-join back to that same table (fab_status_members) and get the actual status code SPECIFIC to the last status date for that member (second join).
From the result of getting the correct STATUS per Member on the max date, you need to get the TYPE of status that code represented (third join to fab_status_type).
And finally, from knowing the fab_status_type, what is its quota type.
You shouldn't need the group by since the first query is grouped by the members ID and will return a single entry per person (UNLESS... its possible to have multiple status types in the same day in the fab_status_members table... unless that is a full date/time field, then you are ok)
Not sure of the "takst" column which table that comes from, but I try to completely qualify the table names (or aliases) they are coming from, buy my guess is its coming from the QuotaType table.
... EDIT from comment...
Sorry, yeah, FQ for the last join. As for it not returning any rows, I would try them one at a time and see where the break is... I would start one at a time... how many from the maxdate query, then add the join to users to make sure same record count returned. Then add the FSM (re-join) for specific member / date activity, THEN into the status type... somewhere along the chain its missing, and the only thing I can think of is a miss on the status type as any member status would have to be associated with one of the users, and it should find back to itself as that's where the max date originated from. I'm GUESSING its somewhere on the join to the status type or the quota.