Mysql Groupby and Orderby problem

Mysql Groupby and Orderby problem - mysql

Here is my data structure
alt text http://luvboy.co.cc/images/db.JPG
when i try this sql
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
group by dc_number
order by rec_id desc;
something is wrong somewhere, idk
I need
rec_id customer_id dc_number balance
2 IHS050018 DC3 -1
3 IHS050018 52 600
I want the recent balance of the customer with respective to dc_number ?
Thanx

There are essentially two ways to get this
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id='IHS050018' and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
Also if you want to get the last balance for each customer you might do
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id=p.customer_id and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
What I consider essentially another way is utilizing the fact that select rec_id with order by desc and limit 1 is equivalent to select max(rec_id) with appropriate group by, in full:
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select max(s.rec_id)
from payments s
group by s.customer_id, s.dc_number
);
This should be faster (if you want the last balance for every customer), since max is normally less expensive then sort (with indexes it might be the same).
Also when written like this the subquery is not correlated (it need not be run for every row of the outer query) which means it will be run only once and the whole query can be rewritten as a join.
Also notice that it might be beneficial to write it as correlated query (by adding where s.customer_id = p.customer_id and s.dc_number = p.dc_number in inner query) depending on the selectivity of the outer query.
This might improve performance, if you look for the last balance of only one or few rows.

I don't think there is a good way to do this in SQL without having window functions (like those in Postgres 8.4). You probably have to iterate over the dataset in your code and get the recent balances that way.

ORDER comes before GROUP:
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
order by rec_id desc
group by dc_number

Related

How to Insert Data into Tempdb Until Amount of Distinct Values Met

I have a main table named tblorder.
It contains CUID(Customer ID), CuName(Customer Name) and OrDate(Order Date) that I care about.
It is currently ordered by date in ascending order(ex. 2001 before 2002).
Objective:
Trying to retrieve most recent 1 Million DISTINCT Customer's CUID and CuNameS, and Insert them Into a Tempdb(#Recent1M) for Later Joining Uses.
So I:
Would Need Order By desc to flip the date to retrieve most recent 1 Million Customers
Only want first 1 Million DISTINCT Customer Information(CUID, CuName)
I know following code is not correct, but it is the main idea. I just can't figure out the correct syntax. So far I have the While Loop with Select Into as the most plausible solution.
SQL Platform: SSMS
Declare #DC integer
Set #DC = Count(distinct(CUID)) from #Recent1M))
While (#DC <1000000)
Begin
Select CuID,CuName into #Recent1MCus from tblorder
End
Thank you very much, I appreciate any help!

TOP 1000000 is the way to go, but you're going to need an ORDER BY clause or you will get arbitrary results. In your case, you mentioned that you wanted the most recent ones, so:
ORDER BY OrderDate DESC
Also, you might consider using GROUP BY rather than DISTINCT. I think it looks cleaner and keeps the select list a select list so you have the option to include whatever else you might want (as I took the liberty of doing). Notice that, because of the grouping, the ORDER BY now uses MAX(ordate) since customers can presumably have multiple ordate's and we are interested in the most recent. So:
select top 1000000 cuid, cuname, sum(order_value) as ca_ching, count(distinct(order_id)) as order_count
into #Recent1MCus
from tblorder
group by cuid, cuname
order by max(ordate) desc
I hope this helps.

Wouldn't you just do this?
select distinct top 1000000 cuid, cuname
into #Recent1MCus
from tblorder;
If the names might not be distinct, you can do:
select top 1000000 cuid, cuname
into #Recent1MCus
from (select o.*, row_number() over (partition by cuid order by ordate desc) as seqnum
from tblorder o
) o
where seqnum = 1;

Use DISTINCT and ORDER BY <colname> DESC to get latest unique records.
Try this SQL query:
SELECT DISTINCT top 1000000
cuid,
cuname
INTO #Recent1MCus
FROM tblorder
ORDER BY OrDate DESC;

Joining on "greater than" returning more than one row for left table

I have a query.
SELECT * FROM users LEFT JOIN ranks ON ranks.minPosts <= users.postCount
This returns a row every time it is matched. By using a GROUP BY users.id I get each row as a individual id.
However, when they group I only get the first row. I would instead like the row with the highest value of ranks.minPosts
Is there a way to do this, also, would it be faster (less resources) to just use two different queries?

Assuming there is only one column in ranks that you want, you can do this using a correlated subquery:
SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u;
If you need the entire row from ranks, then join it back in:
SELECT ur.*, r.*
FROM (SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u
) ur join
ranks r
on ur.minPosts = r.minPosts;
(The * is for convenience; you should list out the columns you want.)

Because you're using mysql, this will work:
SELECT * FROM (
SELECT *, users.id user_id
FROM users
LEFT JOIN ranks ON ranks.minPosts <= users.postCount
ORDER BY ranks.minPosts DESC
) x
GROUP BY user_id
Mysql always returns the first row encountered for each unique group, so if you first order the data, then use the non-standard grouping behaviour, you'll get the row you want.
Disclaimer:
Although this works reliably in practice, the mysql documentation says not to rely on it. If you use this convenient approach (which will reliably pass any test you can write), you should consider that it is not recommended by mysql and that later releases of mysql may not continue behave in this way.

What we'd really like to do would be to order the rows by ranks.minPosts before the group by. Unfortunately MySQL doesn't support that without using a subquery of some form.
If the ranks are already ordered by their ids then you can extract the id by selecting MAX(ranks.id), and if they're not, you can still get the highest ranks.minPosts by selecting MAX(ranks.minPosts). However, it would be nice to be able to get the entire record. I guess you're left with the subquery solution, which is as follows:
SELECT <fields> FROM users LEFT JOIN
(SELECT * FROM ranks ORDER BY minPosts DESC) as r
ON r.minPosts <= users.postCount GROUP BY users.id

MySQL Left Join Taking awhile

I am trying to get a list of possible customers along with the sum of their order history (ltv)
Without the order by, this query loads in under a second. With the order by and the query is taking over 90 seconds.
SELECT a.customerid,a.firstname,a.lastname,Orders.ltv
FROM customers a
LEFT JOIN (
SELECT customerid,
SUM(amount) as ltv
FROM orders
GROUP BY customerid) Orders
ON Orders.customerid=a.customerid
ORDER BY
Orders.ltv DESC
LIMIT 0,10
Any ideas how this could be sped up?
EDIT: I guess I cleaned up the query a little too much. The query is acually a little more complicated then this version. Other data is selected from the customers table, and can be sorted against as well.

Without the actual schema it is a bit hard to know how data is related but I guess this query should be equivalent and more performant:
SELECT a.customerid, coalesce(sum(o.amount), 0) TotalLtv FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10
The coalesce will make sure you return 0 for the customers without orders.
As #ypercube made me notice, an index on amount won't help either. You could give it a try to:
ALTER TABLE orders ADD INDEX(customer, amount)
After your question update
If you need to add more fields that functionally depend on the a.customerid in the select clause you can use the non-standard MySQL group by clause. This will result in better performance than grouping by a.customerid, a.firstname, a.lastname:
SELECT a.customerid, a.firstname, a.lastname, coalesce(sum(o.amount), 0) TotalLtv
FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10

A few things here. First it doesn't appear that you need to join the customers table at all here since you are only using it for the customerid, which already exists in orders table. If you have more than 10 customer id's with corresponding amounts, you will never even need to see the list of customer id's which don;t have amounts that you would get with LEFT JOIN from customers. As such, you should be able to reduce your query to this:
SELECT customerid, SUM(amount) AS ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10
You would need an index on customerid. Unfortunately, the sort is on a calculated field, so there is not a lot you can do to speed this up from that point.
I see the updated question. Since you do need additional fields from customers, I will revise my answer to include the customer table
SELECT c.customerid, c.firstname, c.lastname, coalesce(o.ltv, 0) AS total
FROM customers AS c
LEFT JOIN (
SELECT customerid, SUM(amount) as ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10) AS o
ON c.customerid = o.customerid
Note that I am joining on a sub-selected table as you were doing in your original query, however I have performed the sort and limit on the sub-selected table so you don't have to sort all the records without any entries on orders table.

Two things. First, don't use an inner query. MySQL does allow ORDER BY on a projection alias. Second, you should get a considerable improvment by having a B-TREE index on the composed key (customerid, amount). Then the engine will be able to execute this query by a simple traversal of the index, without fetching any row data.

MySQL huge tables JOIN makes database collapse

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?

300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.

You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.

MySQL Group By and HAVING

I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.
But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?
Query 1: HAVING, then AND
SELECT user_id
FROM forum_posts
GROUP BY user_id
HAVING COUNT(id) >= 100
AND user_id NOT IN (SELECT user_id FROM banned_users)
Query 2: WHERE, then HAVING
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN(SELECT user_id FROM banned_users)
GROUP BY user_id
HAVING COUNT(id) >= 100

Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE
Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second
Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.
The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.

You have already answers that the two queries will show same results and various opinions for which one is more efficient.
My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.
I would use the second version in any case, just to play safe.
Let me add that:
COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.
Therefore, you can also use:
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN
( SELECT user_id FROM banned_users )
GROUP BY user_id
HAVING COUNT(*) >= 100
There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.
Using LEFT JOIN / NULL :
SELECT fp.user_id
FROM forum_posts AS fp
LEFT JOIN banned_users AS bu
ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Using NOT EXISTS :
SELECT fp.user_id
FROM forum_posts AS fp
WHERE NOT EXISTS
( SELECT *
FROM banned_users AS bu
WHERE bu.user_id = fp.user_id
)
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.

HAVING conditions are applied to the grouped by results, and since you group by user_id, all of their possible values will be present in the grouped result, so the placing of the user_id condition is not important.

To me, second query is more efficient because it lowers the number of records for GROUP BY and HAVING.
Alternatively, you may try the following query to avoid using IN:
SELECT `fp`.`user_id`
FROM `forum_posts` `fp`
LEFT JOIN `banned_users` `bu` ON `fp`.`user_id` = `bu`.`user_id`
WHERE `bu`.`user_id` IS NULL
GROUP BY `fp`.`user_id`
HAVING COUNT(`fp`.`id`) >= 100
Hope this helps.

No it does not gives same results.
Because first query will filter records from count(id) condition
Another query filter records and then apply having clause.
Second Query is correctly written

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Mysql Groupby and Orderby problem - mysql

I don't think there is a good way to do this in SQL without having window functions (like those in Postgres 8.4). You probably have to iterate over the dataset in your code and get the recent balances that way.

ORDER comes before GROUP: select rec_id, customer_id, dc_number, balance from payments where customer_id='IHS050018' order by rec_id desc group by dc_number

Related

How to Insert Data into Tempdb Until Amount of Distinct Values Met

Joining on "greater than" returning more than one row for left table

MySQL Left Join Taking awhile

MySQL huge tables JOIN makes database collapse

MySQL Group By and HAVING

Categories

Resources