mySQL bringing back result it should not - mysql

I have a table filled with tasting notes written by users, and another table that holds ratings that other users give to each tasting note.
The query that brings up all notes that are written by other users that you have not yet rated looks like this:
SELECT tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note, COALESCE(sum(tasteNoteRate.Score), 0) as count,
CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END AS userScored
FROM tastingNotes
left join tasteNoteRate on tastingNotes.noteID = tasteNoteRate.noteID
WHERE tastingNotes.userID != 1162
Group BY tastingNotes.noteID
HAVING userScored < 1
ORDER BY count, userScored
User 1162 has written a note for note 113. In the tasteNoteRate table it shows up as:
noteID | userVoting | score
113 1162 0
but it is still returned each time the above query is run....

MySQL allows you to use group by in a rather special way without complaining, see the documentation:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. [...] In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
This behaviour was the default behaviour prior to MySQL 5.7.
In your case that means, if there is more than one row in tasteNoteRate for a specific noteID, so if anyone else has already voted for that note, userScored, which is using tasteNoteRate.userVoting without an aggregate function, will be based on a random row - likely the wrong one.
You can fix that by using an aggregate:
select ...,
max(CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END) AS userScored
from ...
or, because the result of a comparison (to something other than null) is either 1 or 0, you can also use a shorter version:
select ...,
coalesce(max(tasteNoteRate.userVoting = 1162),0) AS userScored
from ...
To be prepared for an upgrade to MySQL 5.7 (and enabled ONLY_FULL_GROUP_BY), you should also already group by all non-aggregate columns in your select-list: group by tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note.
A different way of writing your query (amongst others) would be to do the grouping of tastingNoteRates in a subquery, so you don't have to group by all the columns of tastingNotes:
select tastingNotes.*,
coalesce(rates.count, 0) as count,
coalesce(rates.userScored,0) as userScored
from tastingNotes
left join (
select tasteNoteRate.noteID,
sum(tasteNoteRate.Score) as count,
max(tasteNoteRate.userVoting = 1162) as userScored
from tasteNoteRate
group by tasteNoteRate.noteID
) rates
on tastingNotes.noteID = rates.noteID and rates.userScored = 0
where tastingNotes.userID != 1162
order by count;
This also allows you to get the notes the user voted on by changing rates.userScored = 0 in the on-clause to = 1 (or remove it to get both).

Change to an inner join.
The tasteNoteRate table is being left joined to the tastingNotes, which means that the full tastingNotes table (matching the where) is returned, and then expanded by the matching fields in the tasteNoteRate table. If tasteNoteRate is not satisfied, it doesn't prevent tastingNotes from returning the matched fields. The inner join will take the intersection.
See here for more explanation of the types of joins:
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
Make sure to create an index on noteID in both tables or this query and use case will quickly explode.
Note: Based on what you've written as the use case, I'm still not 100% certain that you want to join on noteID. As it is, it will try to give you a joined table on all the notes joined with all the ratings for all users ever. I think the CASE...END is just going to interfere with the query optimizer and turn it into a full scan + join. Why not just add another clause to the where..."and tasteNoteRate.userVoting = 1162"?
If these tables are not 1-1, as it looks like (given the sum() and "group by"), then you will be faced with an exploding problem with the current query. If every note can have 10 different ratings, and there are 10 notes, then there are 100 candidate result rows. If it grows to 1000 and 1000, you will run out of memory fast. Eliminating a few rows that the userID hasn't voted on will remove like what 10 rows from eventually 1,000,000+, and then sum and group them?
The other way you can do it is to reverse the left join:
select ...,sum()... from tasteNoteRate ... left join tastingNotes using (noteID) where userID != xxx group by noteID, that way you only get tastingNotes information for other users' notes.
Maybe that helps, maybe not, but yeah, SCHEMA and specific use cases/example data would be helpful.
With this kind of "ratings of ratings", sometimes its better to maintain a summary table of the vote totals and just track which the user has already voted on. e.g. Don't sum them all up in the select query. Instead, sum it up in the insert...on duplicate key update (total = total + 1); At least thats how I handle the problem in some user ranking tables. They just grow so big so fast.

Related

Where clause doesn't limit records when aggregate used in select

I have this select in my MySQL DB:
select r.ID, r.ReservationDate, SUM(p.Amount) AS Amount
from Reservations r
join Payments p
on r.ID = p.ReservationID
where r.ConfirmationNumber = '123456'
and p.CCLast4 = '3506'
and r.ID = 54321
It gives me exactly 1 record -- the correct record -- as expected. But if I change the CCLast4 (3506) to any old number/string I want, I still get the record back, but Amount is null. I would expect no record at all because the where clause no longer matches. If I change the the ConfirmationNumber or the ID, as expected I get back no results. But CCLast4 is being completely ignored.
If I remove the aggregate: SUM(p.Amount) AS Amount - all is good, and the CCLast4 demands the correct number before returning the string.
I don't understand why the aggregate causes the where clause related to the Payments table (CCLast4 column) to be ignored.
How can I change the query so that I can use the aggregate in the select AND all the where clauses are honored?
This is actually the expected behaviour. From the manual:
Without GROUP BY, there is a single group and it is nondeterministic which [non-aggregated column] value to choose for the group.
Although not very clearly stated, this means that you always get one group (so one row), even for an empty table.
It is also worth emphasizing that the values that MySQL chooses for the non-aggregated columns in your select, r.ID and r.ReservationDate, are in fact nondeterministic and will specifically vary across MySQL versions (e.g. they will usually be null for MySQL 8.0 while they will usually contain existing values for earlier versions).
The solution is similarly subtle - add a group by (so the quoted sentence does not apply anymore):
...
where r.ConfirmationNumber = '123456'
and p.CCLast4 = 'xxx'
and r.ID = 54321
group by r.ID, r.ReservationDate
should give you 0 rows.

MySQL Selecting things where a condition on a row is met 2 or more times, but showing the two or more results

If I use GROUP BY then I will get just 1 row per group. For example
Sessions table: SessionId (other things)
Actions table: ActionId, SessionId, (other things)
With:
SELECT S.*, A.* FROM ActionList A JOIN SessionList S ON A.SessionId
=S.SessionId
WHERE 1 /*various criteria to filter*/
ORDER BY S.SessionId DESC, ActionId DESC;
Thus showing me the most recent session at the top. Now I want to look at only sessions with 2 or more actions.
If I use GROUP BY A.SessionId then I can get COUNT(ActionId) and use HAVING to look at rows only with the required count, but I wont get both rows (or more) rows, just the one.
I suspect I can do this by JOINing a table with SessionIds and the count of action IDs but I'm fairly new to joins (I could do this via a subquery any ANY).
If a view would help, I would create a view of the form:
SELECT SessionId, COUNT(*) FROM Actions GROUP BY SessionId;
Or put this in brackets and JOIN on it (but I confess I'd have to loop up 3 table joins)
What is the neatest way to do this?
Also is this where "Foreign keys" come into play? That'd probably stop the "ambiguity errors" I get if I don't qualify SessionId. I've avoided them for fear of TRIGGERs, I also didn't know about JOINs and just used subqueries until recently. I've realised it is stupid to avoid things that were added to help.
Additionally I'm quite timid with joins because I know what it does, well worst case. If I JOIN on a table with m rows, and another with n I end up with m*n rows. That could be VERY large! I'm dealing with large tables (as in: schema wont fit in RAM large) so that is quite scary. I do know MySQL optimises well (able to move stuff from HAVING to WHERE and so forth) but still!
If you want to look at sessions with two or more actions, then use a join:
select sl.*
from SessionList sl join
(select SessionId, count(*) as cnt
from Actions
group by SessionId
) a
on sl.SessionId = a.SessionId and cnt > 1;

mysql displaying grouped columns base on condition

I am working on a query that needs to output 'total engagements' by users in columns like 1 -eng column will display users who have one engagements, second column 2-eng which will display users who have done 2 engagements. Likewise 3eng, and so on. Note that the display should be like this. I have a engagement table which has userID. So I get distinct users like this
select count(distinct userID) from engagements
and I get engagements as
select count(*) from engagements
Engagements here refers to users who have either liked,replied,or shared the content
Please help. Thanks! I have used CASE and IF but unable to display in the below form
1eng 2eng 3eng
100 200 100
Consider returning the results in rows and pivoting them afterwards in your application.
To return the desired results in rows, you could use the following query:
SELECT
engagementCount,
COUNT(*) AS userCount
FROM (
SELECT
userID,
COUNT(*) AS engagementCount
FROM engagements
GROUP BY userID
) AS s
GROUP BY engagementCount
;
Basically, you first group the engagements rows by userID and get the row counts per userID. Afterwards, you use the counts as the grouping criterion and count how many users were found with that count.
If you insist on returning the columnar view in SQL, you'll need to resort to dynamic SQL because of the indefinite number of columns in the final result set. You'd probably need to store the results of the inner SELECT temporarily, scan it to build the list of count expressions for every engagementCount value and ultimately construct a query of this kind:
SELECT
COUNT(engagementCount = 1 OR NULL) AS `1eng`,
COUNT(engagementCount = 2 OR NULL) AS `2eng`,
COUNT(engagementCount = 3 OR NULL) AS `3eng`,
...
FROM temporary_storage
;
Or SUM(engagementCount = value) instead COUNT(engagementCount = value OR NULL). (For me, the latter expresses the intention more explicitly, hence why I've suggested it first, but, in case you happen to prefer the SUM technique, there should be no discernible difference in performance between the two. The OR NULL trick is explained here.)

How do I join one table onto another where userid = userid but only for that date?

I'm looking to take the total time a user worked on each batch at his workstation, the total estimated work that was completed, the amount the user was paid, and how many failures the user has had for each day this year. If I can join all of this into one query then I can use it in excel and format things nicely in pivot tables and such.
EDIT: I've realized that is only possible to do this in multiple queries so I have narrowed my scope down to this:
SELECT batch_log.userid,
batches.operation_id,
SUM(TIME_TO_SEC(ramses.batch_log.time_elapsed)),
SUM(ramses.tasks.estimated_nonrecurring + ramses.tasks.estimated_recurring),
DATE(start_time)
FROM batch_log
JOIN batches ON batch_log.batch_id=batches.id
JOIN ramses.tasks ON ramses.batch_log.batch_id=ramses.tasks.batch_id
JOIN protocase.tblusers on ramses.batch_log.userid = protocase.tblusers.userid
WHERE DATE(ramses.batch_log.start_time) > "2011-01-01"
AND protocase.tblusers.active = 1
GROUP BY userid, batches.operation_id, start_time
ORDER BY start_time, userid ASC
The cross join was causing the problem.
No, in general a Having clause is used to filter the results of your Group by - for example, only reporting those who were paid for more than 24 hours in a day (HAVING SUM(ramses.timesheet_detail.paidTime) > 24). Unless you need to perform filtering of aggregate results, you shouldn't need a having clause at all.
Most of those conditions should be moved into a where clause, or as part of the joins, for two reasons - 1) Filtering should in general be done as soon as possible, to limit the work the query needs to perform. 2) If the filtering is already done, restating it may cause the query to perform additional, unneeded work.
From what I've seen so far, it appears that you're trying to roll things up by the day - try changing the last column in the group by clause to date(ramses.batch_log.start_time), or you're grouping by (what I assume is) a timestamp.
EDIT:
About schema names - yes, you can name them in the from and join sections. Often, too, the query may be able to resolve the needed schemas based on some default search list (how or if this is set up depends on your database).
Here is how I would have reformatted the query:
SELECT tblusers.userid, operations.name AS name,
SUM(TIME_TO_SEC(batch_log.time_elapsed)) AS time_elapsed,
SUM(tasks.estimated_nonrecurring + tasks.estimated_recurring) AS total_estimated,
SUM(timesheet_detail.paidTime) as hours_paid,
DATE(start_time) as date_paid
FROM tblusers
JOIN batch_log
ON tblusers.userid = batch_log.userid
AND DATE(batch_log.start_time) >= "2011-01-01"
JOIN batches
ON batch_log.batch_id = batches.id
JOIN operations
ON operations.id = batches.operation_id
JOIN tasks
ON batches.id = tasks.batch_id
JOIN timesheet_detail
ON tblusers.userid = timesheet_detail.userid
AND batch_log.start_time = timesheet_detail.for_day
AND DATE(timesheet_detail.for_day) = DATE(start_time)
WHERE tblusers.departmentid = 8
GROUP BY tblusers.userid, name, DATE(batch_log.start_time)
ORDER BY date_paid ASC
Of particular concern is the batch_log.start_time = timesheet_detail.for_day line, which is comparing (what are implied to be) timestamps. Are these really equal? I expect that one or both of these should be wrapped in a date() function.
As for why you may be getting unexpected data - you appear to have eliminated some of your join conditions. Without knowing the exact setup and use of your database, I cannot give the exact reason for your results (or even able to say they are wrong), but I think the fact that you join to the operations table without any join condition is probably to blame - if there are 2 records in that table, it will double all of your previous results, and it looks like there may be 12. You also removed operations.name from the group by clause, which may or may not give you the results you want. I would look into the rest of your table relationships, and see if there are any further restrictions that need to be made.

Order a query with two keys SQL Server 2008

I am trying to order a query by two keys. The query is built with several subqueries. The table contains, beside columns with other data, two columns, Key and Key_Father. So I need to order the results since SQL to print the results in a report. This is an example:
Key Key_Father
4 NULL
1 4
2 4
7 NULL
1 7
2 7
As you can see is a structure father-son, where a row is a father if the Key_Father is NULL and the Key column start from one for each son with a different father.
The first subquery gives the data in order, because is stored on that order in the table, but the second subquery that uses a group by, no. So I tried adding a extra column with Row_Number on the first subquery to keep that order, but the second subquery does the same thing.
This is the query:
SELECT Orden,INV_Key,Key_Padre,INV.INV_ID,INV.BOD_Bodega_ID,
CASE WHEN MAX(HIS_Ventas) > 0 OR max(HIS_Disponible) > 0 THEN 1 ELSE 0 END AS Participacion,MAX(ISNULL(HIS_Ventas,0)) AS Ventas
FROM(SELECT ROW_NUMBER() OVER (ORDER BY C.INV_Compra_ID) Orden,C.BOD_Bodega_ID,INV_Key,Key_Padre,CD.INV_ID
FROM dbo.INV_COMPRAS_USADOS C
INNER JOIN dbo.INV_COMPRAS_USADOS_DET CD ON C.INV_Compra_ID = CD.INV_Compra_ID
WHERE C.INV_Compra_ID = #Compra_ID
AND ((Key_Padre IS NULL AND CD.INV_Catalogo_Codigo = ISNULL(#Cod_Catalogo,CD.INV_Catalogo_Codigo)
AND INV_Key IN (SELECT DISTINCT Key_Padre
FROM dbo.INV_COMPRAS_USADOS_DET
WHERE INV_Compra_ID = #Compra_ID AND Key_Padre IS NOT NULL))
OR Key_Padre IN (SELECT DISTINCT INV_Key
FROM dbo.INV_COMPRAS_USADOS_DET
WHERE INV_Compra_ID = #Compra_ID AND (Key_Padre IS NULL AND CD.INV_Catalogo_Codigo = ISNULL(#Cod_Catalogo,CD.INV_Catalogo_Codigo))))) INV
LEFT JOIN DBO.HIS_HISTORICO_DETALLE HD ON INV.INV_ID = HD.INV_ID AND HD.BOD_Bodega_ID = INV.BOD_Bodega_ID
LEFT JOIN DBO.HIS_HISTORICO_INVENTARIO H on H.HIS_Historico_ID= HD.HIS_Historico_ID AND (CONVERT(datetime,(convert(varchar(20),HIS_Historico_Ano) + '/' + convert(varchar(20),HIS_Historico_Mes) + '/01')) BETWEEN #FechaDesde AND #FechaHasta)
WHERE H.HIS_Historico_Mes IS NOT NULL OR INV.INV_ID IS NULL
GROUP BY Orden,INV_Key,Key_Padre,INV.INV_ID,INV.BOD_Bodega_ID,HIS_Historico_Ano,HIS_Historico_Mes
Another interesting thing (well for me) is that when I change the #Variables for Constant values, the second query keeps the correct order, even when the constant values are the same that the #variables. This is just a portion of the total query, is a subquery that needs of another two selects, and I need to keep the order from those selects too.
So I hope that someone could help me with this. Thanks!
To order the results you need to place an ORDER BY clause on the outermost SELECT statement. Using ORDER BY in a nested SELECT is generally not permitted but even if you work around it (e.g. by using TOP), you can't rely on the results being ordered in any particular way.
Without an ORDER BY the results may appear to be coming out in the order you want but this cannot be relied upon. Running the same query on a different server or at some point in the future may produce a different order where differences in statistics, server load, etc can affect how the query optimizer actually executes the statement.
The portion of the query you've provided is outputting the following columns. Which are the ones you want to order by?
Orden (although this is just an alias for INV_Compra_ID as far as orderin is concerned)
INV_Key
Key_Padre
INV_ID
BOD_Bodega_ID
Participacion
Ventas
Let's say you want to order by just thre of them, then you need to append the following clause to the outermost SELECT:
ORDER BY
Orden,
INV_Key,
Key_Padre,
This should do it. I'm not sure if I'm missing an obvious simplification though.
ORDER BY ISNULL(Key_Father,[Key]), ISNULL(Key_Father,-1),[Key]