Considering the following query:
SELECT COUNT(table1.someField), COUNT(table2.someField)
FROM table1
INNER JOIN table2 ON table2.id = table1.id
GROUP BY table1.id
I am trying to understand what the difference is (if any) between groupping by table1.id and groupping by table2.id. In short, when inner joining two tables on X=Y, what the difference is when groupping by X and when groupping by Y. That's it.
The real world example - pretty straightforward: a table transaction holds transactions information (paid amount, dates etc), and a table transaction_product holds information regarding which products were included in which transaction.
So for example, transaction number 1 could have included products number 1, 2 and 3, and so forth (so the table relation is obviously one-to-many).
The problem: I need to know for each transaction, how much was paid for how many products. This is the query, including both GROUP BY alternatives:
SELECT
`transaction`.id,
SUM(`transaction`.transaction_amount) AS total_amount,
COUNT(`transaction_product`.product_id) AS number_of_products
FROM `transaction`
INNER JOIN `transaction_product` ON `transaction_product`.transaction_id = `transaction`.id
GROUP BY [`transaction`.id [OR] `transaction_product`.transaction_id]
I need to know if there is a difference between the two GROUP BY alternatives. I couldn't find relevant information regarding the GROUP BY behavior in this case in the documentation, therefore any help on clarifying the matter would be much appreciated.
The result of the inner join will be a set of rows with matching transaction IDs, so the set of values that column can have will be the same on both transaction and transaction_product tables.
The group by will return a single row for each available value of the grouped column(s), and all the rows that share the same value will be aggregated with the aggregation function you use. The result
Result: there won't be any difference between the two options you have, because the same rows will be grouped with the exact same criteria, being the set of values the same on both sides.
TL/DR
There is no difference at all.
There is no difference whatsovever which id you choose to include in your GROUP BY clause. The total number of rows for each transaction id will be the number of products for that transaction. This query should get what you need:
SELECT
`transaction`.id,
SUM(`transaction`.transaction_amount) AS total_amount,
COUNT(1) AS number_of_products
FROM `transaction`
INNER JOIN `transaction_product` ON `transaction_product`.transaction_id =
`transaction`.id
GROUP BY `transaction`.id
Related
I have two table with one to many relationship. One is header and the other is detail.
Header Table Columns are State, Item, Bill, Quantity
Detail Table Columns are Header_ID, Allocated To, Allocated Quantity
I need to get the sum of Header Quantity and Sum of Detail Quantity using left join as detail may not exists sometime.
Here is how it looks after joining (Before grouping)
My required result after grouping should be
State, Item, Total Quantity (Header), Total Allocated (Detail Quantity).
But my problem is i am doing left join so header value is repeating, when i sum the duplicate values are also summing up.
Here is the Query i tried
SELECT
a.State,
a.Item,
b.Allocated,
sum(a.Quantity) AS Header_Total, -- Should take unique/single
sum(b.Quantity) AS Detail_Taol
FROM table1 a
LEFT JOIN table2 b ON a.ID = b.Header_ID
GROUP BY
a.State,
a.Item,
b.Allocated;
Please help in querying this to get the desired result.
JOIN (including LEFT JOIN) creates a result set containing all pairs of rows allowed by the ON clause. So your SUM(a,Quantity) expression adds up multiple copies of that value and so generates wrong results in a lot of cases.
This is sometimes called a combinatorial explosion or a cardinality error.
Fix: take off the SUM and add a.Quantity to the GROUP BY.
SELECT
a.State,
a.Item,
b.Allocated,
a.Quantity AS Header_Total,
sum(b.Quantity) AS Detail_Taol
FROM table1 a
LEFT JOIN table2 b ON a.ID = b.Header_ID
GROUP BY
a.State,
a.Item,
b.Allocated,
a.Quantity;
Let's say I have a query:
select product_id, price, price_day
from products
where price>10
and I want to join the result of this query with itself (if for example I want to get in the same row product's price and the price in previous day)
I can do this:
select * from
(
select product_id, price, price_day
from products
where price>10
) as r1
join
(
select product_id, price, price_day
from products
where price>10
) as r2
on r1.product_id=r2.product_id and r1.price_day=r2.price_day-1
but as you can see I am copying the original query, naming it a different name just to join its result with itself.
Another option is to create a temp table but then I have to remember to remove it.
Is there a more elegant way to join the result of a query with itself?
self join query will help
select a.product_ID,a.price
,a.price_day
,b.price as prevdayprice
,b.price_day as prevday
from Table1 a
inner join table1 b
on a.product_ID=b.product_ID and a.price_day = b.price_day+1
where a.price >10
You could do a bunch of things, just a few options could be:
Just let mysql handle the optimization
this will likely work fine until you hit many rows
Make a view for your base query and use that
could increase performance but mostly increases readability (if done right)
Use a table (non temporary) and insert your initial rows in there. (unfortunately you cannot refer to a temporary table more than once in a query)
this will likely be more expensive performance wise until a certain number of rows is reached.
Depending on how important performance is for your situation and how many rows you need to work with the "best" choice would change.
Just to get duplicates in the same row?
select product_id as id1, price as price1, price_day as priceday1, product_id as id2, price as price2, price_day as priceday2,
from products
where price>10
I have two tables clients and transactions and I need to take a query from these two tables in a way which all the clients should select with the total of their transactions.
My problem is when I do a query from these two tables and set the condition; which transactions should have the clients id it shows only those clients that have record in transaction table, but I want it display all the clients even if they don't have any transaction(it can display zero instead sum of transaction).
I know because of condition which belongs to transaction table, query doesn't select persons in clients table which doesn't meet the query requirement, but how can I select all the clients and sum of their transactions or put zero if they don't has any transaction.
this is a short view of tables (only those columns I used in query)
ID Name Company Phone //clients table
ID Client_id Incoming ... //transaction table
Thank you in advance and sorry for my bad english
In addition, you can also do this with a correlated subquery:
SELECT c.*,
(select sum(t.incoming) - sum(t.outgoing)
from transactions t
where t.client_id = c.id
) as total
from clients c;
Under some circumstances, this could have better performance.
SELECT c.Name, count(t.ID)
FROM clients c
left join transactions t on c.CustomerID = t.Client_id
group by t.client_id
you could use a left join, something like :
SELECT *
FROM clients
LEFT JOIN transaction ON client.id = transaction.Client_id
You would get all clients, empty rows from transaction would be set to null, so you'll have to change that to 0
I have an SQL query that needs to perform multiple inner joins, as follows:
SELECT DISTINCT adv.Email, adv.Credit, c.credit_id AS creditId, c.creditName AS creditName, a.Ad_id AS adId, a.adName
FROM placementlist pl
INNER JOIN
(SELECT Ad_id, List_id FROM placements) AS p
ON pl.List_id = p.List_id
INNER JOIN
(SELECT Ad_id, Name AS adName, credit_id FROM ad) AS a
ON ...
(few more inner joins)
My question is the following: How can I optimize this query? I was under the impression that, even though the way I currently query the database creates small temporary tables (inner SELECT statements), it would still be advantageous to performing an inner join on the unaltered tables as they could have about 10,000 - 100,000 entries (not millions). However, I was told that this is not the best way to go about it but did not have the opportunity to ask what the recommended approach would be.
What would be the best approach here?
To use derived tables such as
INNER JOIN (SELECT Ad_id, List_id FROM placements) AS p
is not recommendable. Let the dbms find out by itself what values it needs from
INNER JOIN placements AS p
instead of telling it (again) by kinda forcing it to create a view on the table with the two values only. (And using FROM tablename is even much more readable.)
With SQL you mainly say what you want to see, not how this is going to be achieved. (Well, of course this is just a rule of thumb.) So if no other columns except Ad_id and List_id are used from table placements, the dbms will find its best way to handle this. Don't try to make it use your way.
The same is true of the IN clause, by the way, where you often see WHERE col IN (SELECT DISTINCT colx FROM ...) instead of simply WHERE col IN (SELECT colx FROM ...). This does exactly the same, but with DISTINCT you tell the dbms "make your subquery's rows distinct before looking for col". But why would you want to force it to do so? Why not have it use just the method the dbms finds most appropriate?
Back to derived tables: Use them when they really do something, especially aggregations, or when they make your query more readable.
Moreover,
SELECT DISTINCT adv.Email, adv.Credit, ...
doesn't look to good either. Yes, sometimes you need SELECT DISTINCT, but usually you wouldn't. Most often it is just a sign that you haven't thought your query through.
An example: you want to select clients that bought product X. In SQL you would say: where a purchase of X EXISTS for the client. Or: where the client is IN the set of the X purchasers.
select * from clients c where exists
(select * from purchases p where p.clientid = c.clientid and product = 'X');
Or
select * from clients where clientid in
(select clientid from purchases where product = 'X');
You don't say: Give me all combinations of clients and X purchases and then boil that down so I just get each client once.
select distinct c.*
from clients c
join purchases p on p.clientid = c.clientid and product = 'X';
Yes, it is very easy to just join all tables needed and then just list the columns to select and then just put DISTINCT in front. But it makes the query kind of blurry, because you don't write the query as you would word the task. And it can make things difficult when it comes to aggregations. The following query is wrong, because you multiply money earned with the number of money-spent records and vice versa.
select
sum(money_spent.value),
sum(money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
And the following may look correct, but is still incorrect (it only works when the values happen to be unique):
select
sum(distinct money_spent.value),
sum(distinct money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
Again: You would not say: "I want to combine each purchase with each earning and then ...". You would say: "I want the sum of money spent and the sum of money earned per user". So you are not dealing with single purchases or earnings, but with their sums. As in
select
sum(select value from money_spent where money_spent.userid = user.userid),
sum(select value from money_earned where money_earned.userid = user.userid)
from user;
Or:
select
spent.total,
earned.total
from user
join (select userid, sum(value) as total from money_spent group by userid) spent
on spent.userid = user.userid
join (select userid, sum(value) as total from money_earned group by userid) earned
on earned.userid = user.userid;
So you see, this is where derived tables come into play.
Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?
300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.
You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.