Mysql - Sum by other column value - mysql

Here's the problem. I have a long but not very complex query:
SUM(x.value)
FROM valuetable AS x
LEFT JOIN jointable_1 AS y
LEFT JOIN jointable_2 AS z
etc
...
GROUP BY y.id, z.id
There are n amount of left joins, and I need to keep it this way, for a new left join must be available any time. I obviously get n value dublicates into SUM, since jointables can have multiple results, and I can not break any of them into subquery for flexible WHERE reasons. I need only one x.value per x.id into SUM, thats also obvious.
-I cannot add x.id to GROUP BY, since I so need one row to have sum per y.id.
-I cannot use the calculation:
SUM(x.value)*COUNT(DISTINCT x.id)/COUNT(*)
since there can be any number of x.values in sum, as different x.id-s have different amount of joins.
-I cannot go for DISTINCT x.value, since any x.id can have any x.value and they can contain same value.
-I don't know how to create a subquery for sum, since I cannot use the aggregated value (for example GROUP_CONCAT(DISTINCT x.id)) in subquery, or can I?
Anyways, thats it. I know I can rearrange the query(subqueries instead of joins, different from), but I want to leave it as the last resort. Is there a way to achieve what I want?

Sorry to say, there's no general way to do what you want without subqueries (or maybe views).
A bit of jargon: "Cardinality". For our purpose it's the number of rows in a table or a result set. (For our purpose a result set is a kind of virtual table.)
For aggregate functions like SUM(col) and COUNT(*) to give good results, we must attend to the cardinality of the table being summarized. This kind of thing
SELECT DATE(sale_time) sale_date,
store_id,
SUM(sale_amount) total_sales
FROM sale
GROUP BY DATE(sale_time), store_id
summarizes the same cardinality of result table as the underlying table, so it generates useful results.
But, if we do this
SELECT DATE(sale.sale_time) sale_date,
sale.store_id,
SUM(sale.sale_amount) total_sales,
COUNT(promo.promo_id) promos
FROM sale
LEFT JOIN promo ON sale.store_id = promo.store_id
AND DATE(sale.sale_time) = promo.promo_date
GROUP BY DATE(sale.sale_time), sale.store_id
we wreck the cardinality of the summarized result set. This will never work unless we know for sure that each store had either zero or one promo records for each given day. Why not? The LEFT JOIN operation affects the cardinality of the virtual table being summarized. That means some sale_amount values my show up in the SUM more than once, and therefore the SUM won't be correct, or trustworthy.
How can you prevent LEFT JOIN operations from messing up your cardinality? Make sure your LEFT JOIN's ON clause matches each row on the right to exactly zero rows, or exactly one row, on the left. That is, make sure you (virtual) tables on either side of the JOIN have appropriate cardinality.
(In entity-relationship jargon, your SUM fails because you join two entities with a one-to-many relationship before you do the sum.)
The theoretically cleanest way to do it is to perform both aggregate operations before the join. This joins two virtual tables in a way that the LEFT JOIN is either one-to-none or one-to-one
SELECT sales.sale_date,
sales.store_id,
sales.total_sales,
promos.promo_count
FROM (
SELECT DATE(sale_time) sale_date,
store_id,
SUM(sale_amount) total_sales
FROM sale
GROUP BY DATE(sale_time), sale_store
) sales
LEFT JOIN (
SELECT store_id,
promo_date
COUNT(*) promo_count
FROM promo
GROUP BY store_id, promo_date
) promos ON sales.store_id = promos.store_id
AND sales.sale_date = promo.promo_date
Although this SQL is complex, most servers handle this kind of pattern efficiently.
Troubleshooting tip: If you see SUM() ... FROM ... JOIN ... GROUP BY all at the same level of a query, you may have cardinality problems.

Related

mySQL bringing back result it should not

I have a table filled with tasting notes written by users, and another table that holds ratings that other users give to each tasting note.
The query that brings up all notes that are written by other users that you have not yet rated looks like this:
SELECT tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note, COALESCE(sum(tasteNoteRate.Score), 0) as count,
CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END AS userScored
FROM tastingNotes
left join tasteNoteRate on tastingNotes.noteID = tasteNoteRate.noteID
WHERE tastingNotes.userID != 1162
Group BY tastingNotes.noteID
HAVING userScored < 1
ORDER BY count, userScored
User 1162 has written a note for note 113. In the tasteNoteRate table it shows up as:
noteID | userVoting | score
113 1162 0
but it is still returned each time the above query is run....
MySQL allows you to use group by in a rather special way without complaining, see the documentation:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. [...] In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
This behaviour was the default behaviour prior to MySQL 5.7.
In your case that means, if there is more than one row in tasteNoteRate for a specific noteID, so if anyone else has already voted for that note, userScored, which is using tasteNoteRate.userVoting without an aggregate function, will be based on a random row - likely the wrong one.
You can fix that by using an aggregate:
select ...,
max(CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END) AS userScored
from ...
or, because the result of a comparison (to something other than null) is either 1 or 0, you can also use a shorter version:
select ...,
coalesce(max(tasteNoteRate.userVoting = 1162),0) AS userScored
from ...
To be prepared for an upgrade to MySQL 5.7 (and enabled ONLY_FULL_GROUP_BY), you should also already group by all non-aggregate columns in your select-list: group by tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note.
A different way of writing your query (amongst others) would be to do the grouping of tastingNoteRates in a subquery, so you don't have to group by all the columns of tastingNotes:
select tastingNotes.*,
coalesce(rates.count, 0) as count,
coalesce(rates.userScored,0) as userScored
from tastingNotes
left join (
select tasteNoteRate.noteID,
sum(tasteNoteRate.Score) as count,
max(tasteNoteRate.userVoting = 1162) as userScored
from tasteNoteRate
group by tasteNoteRate.noteID
) rates
on tastingNotes.noteID = rates.noteID and rates.userScored = 0
where tastingNotes.userID != 1162
order by count;
This also allows you to get the notes the user voted on by changing rates.userScored = 0 in the on-clause to = 1 (or remove it to get both).
Change to an inner join.
The tasteNoteRate table is being left joined to the tastingNotes, which means that the full tastingNotes table (matching the where) is returned, and then expanded by the matching fields in the tasteNoteRate table. If tasteNoteRate is not satisfied, it doesn't prevent tastingNotes from returning the matched fields. The inner join will take the intersection.
See here for more explanation of the types of joins:
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
Make sure to create an index on noteID in both tables or this query and use case will quickly explode.
Note: Based on what you've written as the use case, I'm still not 100% certain that you want to join on noteID. As it is, it will try to give you a joined table on all the notes joined with all the ratings for all users ever. I think the CASE...END is just going to interfere with the query optimizer and turn it into a full scan + join. Why not just add another clause to the where..."and tasteNoteRate.userVoting = 1162"?
If these tables are not 1-1, as it looks like (given the sum() and "group by"), then you will be faced with an exploding problem with the current query. If every note can have 10 different ratings, and there are 10 notes, then there are 100 candidate result rows. If it grows to 1000 and 1000, you will run out of memory fast. Eliminating a few rows that the userID hasn't voted on will remove like what 10 rows from eventually 1,000,000+, and then sum and group them?
The other way you can do it is to reverse the left join:
select ...,sum()... from tasteNoteRate ... left join tastingNotes using (noteID) where userID != xxx group by noteID, that way you only get tastingNotes information for other users' notes.
Maybe that helps, maybe not, but yeah, SCHEMA and specific use cases/example data would be helpful.
With this kind of "ratings of ratings", sometimes its better to maintain a summary table of the vote totals and just track which the user has already voted on. e.g. Don't sum them all up in the select query. Instead, sum it up in the insert...on duplicate key update (total = total + 1); At least thats how I handle the problem in some user ranking tables. They just grow so big so fast.

wrapping inside aggregate function in SQL query

I have 2 tables called Orders and Salesperson shown below:
And I want to retrieve the names of all salespeople that have more than 1 order from the tables above.
Then firing following query shows an error:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The error is:
Column 'Name' is invalid in the select list because it is
not contained in either an aggregate function or
the GROUP BY clause.
From the error and searching it on google, I could understand that the error is because of Name column must be either a part of the group by statement or aggregate function.
Also I tried to understand why does the selected column have to be in the group by clause or art of an aggregate function? But didn't understand clearly.
So, how to fix this error?
SELECT max(Name) as Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The basic idea is that columns that are not in the group by clause need to be in an aggregate function now here due to the fact that the name is probably the same for every salesperson_id min or max make no real difference (the result is the same)
example
Looking at your data you have 3 entry's for Dan(7) now when a join is created the with row Dan (Name) gets multiplied by 3 (For every number 1 Dan) and then the server does not now witch "Dan" to pick cos to the server that are 3 lines even doh they are semantically the same
also try this so that you see what I am talking about:
SELECT Orders.Number, Salesperson.Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
As far as the query goes INNER JOIN is a better solution since its kinda the standard for this simple query it should not matter but in some cases can happen that INNER JOIN produces better results but as far as I know this is more of a legacy thing since this days the server should pretty much produce the same execution plan.
For code clarity I would stick with INNER JOIN
Assuming the name is unique to the salesperson.id then simply add it to your group by clause
GROUP BY salesperson_id, salesperson.Name
Otherwise use any Agg function
Select Min(Name)
The reason for this is that SQL doesn't know whether there are multiple name per salesperson.id
For readability and correctness, I usually split aggregate queries into two parts:
The aggregate query
Any additional queries to support fields not contained in aggregate functions
So:
1.Aggregate query - salespeople with more than 1 order
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
2.Use aggregate as subquery (basically a select joining onto another select) to join on any additional fields:
SELECT *
FROM Salesperson SP
INNER JOIN
(
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
) AGG_QUERY
ON AGG_QUERY.salesperson_id = SP.ID
There are other approaches, such as selecting the additional fields via aggregation functions (as shown by the other answers). These get the code written quickly so if you are writing the query under time pressure you may prefer that approach. If the query needs to be maintained (and hence readable) I would favour subqueries.

Database specific selection of data

I have a database and one of tables has the following structure:
recordId, vehicleId, dateOfTireChange, expectedKmBeforeNextChange, tireType
I want to make such a selection from the table that i only get thouse rows that contain the most recent date for each vehicleId.
I tried this approach
SELECT vehicleid,
Max(dateoftirechange) AS lastChange,
expectedkmbeforenextchange,
tiretype
FROM vehicle_tires
GROUP BY vehicleid
but it doesn't select the kilometers associated with the most recent date so it does not work.
Any idea how to make this selection?
There are several ways to get the desired result.
Correlated scalar subquery...
SELECT vt1.*
FROM vehicle_tire vt1
WHERE vt1.recordId = (SELECT vt2.recordId
FROM vehicle_tire vt2
WHERE vt2.vehicleId = vt1.vehicleId
ORDER BY vt2.dateOfTireChange DESC limit 1);
...or derived table...
SELECT vt2.*
FROM vehicle_tire vt2
JOIN (SELECT vt1.vehicleId as vehicleId,
MAX(vt1.dateOfTireChange) as maxDateOfTireChange
FROM vehicle_tire vt1
GROUP BY vt1.vehicleId) dt ON vt2.vehicleId = dt.vehicleId
AND vt2.dateOfTireChange = dt.dateOfTireChange;
...are two that come to mind.
The reason GROUP BY is not correct when applied to the whole table is that any columns you do not GROUP BY and that are also not the subject of aggregate functions MIN() MAX() AVG() COUNT(), etc., are assumed by the server to be columns that you know to be identical in every row of the groups established by the GROUP BY clause.
If, for example, I'm doing a query like this...
SELECT p.id,
p.full_name,
p.date_of_birth,
COUNT(c.id) AS number_of_children
FROM parent p LEFT JOIN child c ON c.parent_id = p.id
GROUP BY p.id;
The correct way to write this query would be GROUP BY p.id, p.full_name, p.date_of_birth, because none of those columns are part of the aggregate function COUNT().
The MySQL optimization allows you to exclude those columns that you know have to, by definition, be the same on each group from the GROUP BY, and the server will fill those columns with data from any row in the group. Which row is not defined. As you can see, in the example, the parent's full_name would be the same in all rows within a group-by parent.id, and that is a case when this optimization is legitimate. The justification is that it allows the server to have to handle smaller values (fewer bytes) when executing the grouping... but in a query like yours where the ungrouped columns have different values within each group, you get an invalid result, by design.
The SQL_MODE ONLY_FULL_GROUP_BY disables this optimization.

Left Outer Join with Missing Records

I'm selecting data from several tables, but the main idea is that a product may or may not have a discount record associated with it, as either a percent off or dollar off amount. I'm using a left outer join (which may be incorrect) and am getting back the same values for dollar and percent values, regardless as to whether or not the records exist.
The query looks something like:
SELECT Items.ItemID, Items.name, Items.price,
ItemDiscounts.percentOff, ItemDiscounts.dollarOff,
ItemAttributes.ColorName, ItemStuff.StuffID
FROM Items, ItemAttributes, ItemStuff
LEFT OUTER JOIN ItemDiscounts
ON ItemDiscounts.ItemID = ItemID
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.dollarOff > 0
)
WHERE Items.ItemID = ItemAttributes.ItemID
AND ItemStuff.ItemID = Items.ItemID
GROUP BY ItemStuff.StuffID
The weird part is that in all results, percentOff returns "1", and dollarOff returns "0", regardless if each item has it's own associated discount record. For spits, I changed ItemDiscounts.percentOff > 0 to ItemDiscounts.percentOff > 1, then dollarAmount changed to all 2's and percentOff was all 0's.
I'm somewhat baffled on this, so any help would be appreciated.
You have an unqualified reference to ItemID in your ON clause... not clear why that's not throwing an "ambiguous column" exception. (Apparently, it's not ambiguous to MySQL, and MySQL is determining which ItemId is being referenced, the odds are good that its not the one you intended.
Also, your query includes references to the ItemStuff rowsource, but there is no such rowsource shown in your query.
I also suspect that the behavior of the GROUP BY that is giving you a result set that doesn't meet your expectation. (More than likely, right now, it's masking the real problem in your query, which could be a CROSS JOIN operation that you didn't intend.
I suggest you try your query without the GROUP BY clause, and confirm the resultset is what you would expect absent the GROUP BY clause.
NOTE: Most other relational database engines will throw an exception with a GROUP BY like you show in your query. They (basically) require that every non-aggregate in the SELECT list be included in the GROUP BY. You can get MySQL to behave the same way (with some particular settings of sql_mode.) MySQL is more liberal, but the result set you get back may not conform to your expectation.
NOTE: I don't see how this query is passing a semantics check, and is returning any resultset at all, given the references to a non-existent ItemStuff rowsource.
For improved readability, I recommend that you not use the comma as the join operator, and that you instead use the JOIN keyword. I also recommend that you move the join predicates from the WHERE clause to an ON clause. I also prefer to give an alias to each rowsource, and use that alias to qualify the columns from it.
Given what you show in your query, I'd write (the parts I can make sense of) like this:
SELECT i.ItemID
, i.name
, i.price
, d.percentOff
, d.dollarOff
, a.ColorName
FROM Items i
JOIN ItemAttributes a
ON a.ItemID = i.ItemID
LEFT
JOIN ItemDiscounts d
ON d.ItemID = i.ItemID
AND ( d.percentOff > 0 OR d.dollarOff > 0 )
I've omitted ItemStuff.StuffID from the SELECT list, because I don't see any ItemStuff rowsource.
I also exclude the WHERE clause, because I don't see any ItemStuff rowsource in your query.
-- WHERE ItemStuff.ItemID = i.ItemID
I omit the GROUP BY because, again, I don't see any ItemStuff rowsource in your query, and because the behavior of the GROUP BY is likely not what I expect, but is rather masking a problem in my query.
-- GROUP BY ItemStuff.StuffID
UPDATE:
#Kyle, the fact your query "timed out" leads me to believe you are generating WAY MORE rows than you expect, like you have a Cartesian product (every row from a table is being "matched" to every row in some other table... 10,000 rows in one table, 10,000 rows in the other table, that will generate 100,000,000 rows.
I think the GROUP BY clause is masking the real problem.
I recommend that for development, you include the PRIMARY KEY of each table as the leading columns in your result set. I would add some reasonable predicates to the driving table (e.g. i.ItemID IN (2,3,5,7) to limit the size of the result set, and ORDER BY the primary key... that should help you identify an unintended Cartesian product.
Do you get what you want when you remove these lines from your query?
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.dollarOff > 0
)
Once you specify an absolute value for the possibly-null side of an outer join, your WHERE clause has to account for it.
Try it with the following clause:
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.percentOff is null
OR ItemDiscounts.dollarOff > 0
OR ItemDiscounts.dollarOff is null
)
Additionaly, you're specifying a GROUP BY without an aggregate. This makes no sense in most cases. You probably want ORDER BY to sort.

MySQL Joins, Group By, and Ordering the Group By Choice

Is it possible to order the GROUP BY chosen results of a MySQL query w/out using a subquery? I'm finding that, with my large dataset, the subquery adds a significant amount of load time to my query.
Here is a similar situation: how to sort order of LEFT JOIN in SQL query?
This is my code that works, but it takes way too long to load:
SELECT tags.contact_id, n.last
FROM tags
LEFT JOIN ( SELECT * FROM names ORDER BY timestamp DESC ) n
ON (n.contact_id=tags.contact_id)
WHERE tags.tag='$tag'
GROUP BY tags.contact_id
ORDER BY n.last ASC;
I can get a fast result doing a simple join w/ a table name, but the "group by" command gives me the first row of the joined table, not the last row.
I'm not really sure what you're trying to do. Here are some of the problems with your query:
selecting n.last, although it is neither in the group by clause, nor an aggregate value. Although MySQL allows this, it's really not a good idea to take advantage of.
needlessly sorting a table before joining, instead of just joining
the subquery isn't really doing anything
I would suggest carefully writing down the desired query results, i.e. "I want the contact id and latest date for each tag" or something similar. It's possible that will lead to a natural, easy-to-write and semantically correct query that is also more efficient than what you showed in the OP.
To answer the question "is it possible to order a GROUP BY query": yes, it's quite easy, and here's an example:
select a, b, sum(c) as `c sum`
from <table_name>
group by a,b
order by `c sum`
You are doing a LEFT JOIN on contact ID which implies you want all tag contacts REGARDLESS of finding a match in the names table. Is that really the case, or will the tags table ALWAYS have a "Names" contact ID record. Additionally, your column "n.Last". Is this the person's last name, or last time something done (which I would assume is actually the timestamp)...
So, that being said, I would just do a simple direct join
SELECT DISTINCT
t.contact_id,
n.last
FROM
tags t
JOIN names n
ON t.contact_id = n.contact_id
WHERE
t.tag = '$tag'
ORDER BY
n.last ASC