Retrieve detail rows of a group based on grand total - mysql

I have a table that looks like this one :
+------+------+------------------+
| item | val | timestamp |
+------+------+------------------+
| 1 | 3.66 | 16-05-2011 09:17 |
| 1 | 2.56 | 16-05-2011 09:47 |
| 2 | 4.23 | 16-05-2011 09:37 |
| 3 | 6.89 | 16-05-2011 11:26 |
| 3 | 1.12 | 16-05-2011 12:11 |
| 3 | 4.56 | 16-05-2011 13:23 |
| 4 | 1.10 | 16-05-2011 14:11 |
| 4 | 9.79 | 16-05-2011 14:23 |
| 5 | 1.58 | 16-05-2011 15:27 |
| 5 | 0.80 | 16-05-2011 15:29 |
| 6 | 3.80 | 16-05-2011 15:29 |
+------+------+------------------+
so, the grand total of all item for the day : 16 May 2011 is : 40.09
Now i want to retrieve which items of this list form an amount of 80% of the grand total.
Let me make an example :
Grand Total : 40.09
80% of the Grand Total : 32.07
starting from the item with more percentage weight on the total amount i want to retrieve the grouped list of the item that form the 80% of the grand total :
+------+------+
| item | val |
+------+------+
| 3 | 12.57|
| 4 | 10.89|
| 1 | 6.22|
+------+------+
As you can see the elements in the result set are the elements grouped by item code and ordered from the element with greater percentage weight on the grand total descending until reaching the 80% threshold.
From the item 2 onward the items are discarded from the result set because they exceed the threshold of 80%, because :
12.57 + 10.89 + 6.22 + 4.23 > 32.07 (80 % of the grand total )
This is not an homework, this is a real context where i am stumbled and i need to achieve the result with a single query ...
The query should run unmodified or with few changes on MySQL, SQL Server, PostgreSQL .

You can do this with a single query:
WITH Total_Sum(overallTotal) as (SELECT SUM(val)
FROM dataTable),
Summed_Items(id, total) as (SELECT id, SUM(val)
FROM dataTable
GROUP BY id),
Ordered_Sums(id, total, ord) as (SELECT id, total,
ROW_NUMBER() OVER(ORDER BY total DESC)
FROM Summed_Items),
Percent_List(id, itemTotal, ord, overallTotal) as (
SELECT id, total, ord, total
FROM Ordered_Sums
WHERE ord = 1
UNION ALL
SELECT b.id, b.total, b.ord, b.total + a.overallTotal
FROM Percent_List as a
JOIN Ordered_Sums as b
ON b.ord = a.ord + 1
JOIN Total_Sum as c
ON (c.overallTotal * .8) > (a.overallTotal + b.total))
SELECT id, itemTotal
FROM Percent_List
Which will yield the following:
id itemTotal
3 12.57
4 10.89
1 6.22
Please note that this will not work in mySQL (no CTEs), and will require a more recent version of postgreSQL to work (otherwise OLAP functions are not supported). SQLServer should be able to run the statement as-is (I think - this was written and tested on DB2). Otherwise, you may attempt to translate this into correlated table joins, etc, but it will not be pretty, if it's even possible (a stored procedure or re-assembly in a higher level language may then be your only option).

I don't know of any way this can be done with a single query; you'll probably have to create a stored procedure. The steps of the proc would be something like this:
Calculate the grand total for that day by using a SUM
Get the individual records for that day ordered by val DESC
Keep a running total as you loop through the individual records; as long as the running total is < 0.8 * grandtotal, add the current record to your list

Related

Can I get the full rows when using group by multiple columns?

If the date, item, and category are the same in the table,
I'd like to treat it as the same row and return n rows out of them(ex: if n is 3, then limit 0, 3).
------------------------------------------
id | date | item | category | ...
------------------------------------------
101 | 20220201| pencil | stationery | ... <---
------------------------------------------ | treat as same result
105 | 20220201| pencil | stationery | ... <---
------------------------------------------
120 | 20220214| desk | furniture | ...
------------------------------------------
125 | 20220219| tongs | utensil | ... <---
------------------------------------------ | treat as same
129 | 20220219| tongs | utensil | ... <---
------------------------------------------
130 | 20220222| tongs | utensil | ...
expected results (if n is 3)
-----------------------------------------------
id | date | item | category | ... rank
-----------------------------------------------
101 | 20220201| pencil | stationery | ... 1
-----------------------------------------------
105 | 20220201| pencil | stationery | ... 1
-----------------------------------------------
120 | 20220214| desk | furniture | ... 2
-----------------------------------------------
125 | 20220219| tongs | utensil | ... 3
-----------------------------------------------
129 | 20220219| tongs | utensil | ... 3
The problem is that I have to bring the values of each group as well.
If I have only one column to group by, I can compare id value with origin table, but I don't know what to do with multiple columns.
Is there any way to solve this problem?
For reference, I used a user variable to compare it with previous values,
I couldn't use it because the duration was slow.
SELECT
*,
IF(#prev_date=date and #prev_item=item and #prev_category=category,#ranking, #ranking:=#ranking+1) AS sameRow,
#prev_item:=item,
#prev_date:= date,
#prev_category:=category,
#ranking
FROM ( SELECT ...
I'm using Mysql 8.0 version and id value is not a continuous number because I have to order by before group by.
if I understand correctly, you can try to use dense_rank window function and set order by with your expected columns
if date column can represent the order number I would put it first.
SELECT *
FROM (
SELECT *,dense_rank() OVER(ORDER BY date, item, category) rnk
FROM T
) t1
SQLFIDDLE
Window functions come in very handy in this situation. But for those of us still using MySQL 5.7, where functions such as row_number don't exist, we have to either resort to using a user variable and resetting the value every time before the main statement, or defining the user variable directly in the statement.
method 1
set #row_id=0; -- remember to reset the row_id to 0 every time before the main query below
select id,date,item,category,rank from testtb join
(
select date,item,category, (#row_id:=#row_id+1) as rank
from
(select date,item,category from testtb group by date,item,category) t1
) t2
using(date,item,category);
method 2
select id,date,item,category,rank from testtb join
(
select date,item,category, (#row_id:=#row_id+1) as rank
from
(select date,item,category from testtb group by date,item,category) t1, (select #row_id := 0) as n
) t2
using(date,item,category);

How do I get results of a MySQL JOIN where records meet a value criteria in joined table?

This may be simple but I can't figure it out...
I have two tables:
tbl_results:
runID | balance |
1 | 3432
2 | 5348
3 | 384
tbl_phases:
runID_fk | pc |
1 | 34
1 | 2
1 | 18
2 | 15
2 | 18
2 | 20
3 | -20
3 | 10
3 | 60
I want to get a recordset of: runID, balance, min(pc), max(pc) only where pc>10 and pc<50 for each runID as a group, excluding runIDs where any associated pc value is outside of value range.
I would want the following results from what's described above:
runID | balance | min_pc | max_pc
2 | 5348 | 15 | 20
... because runID=1&3 have pc values that fall outside the numeric range for pc noted above.
Thanks in advance!
You may apply filters based on your requirements in your having clause. You may try the following.
Query #1
SELECT
r.runID,
MAX(r.balance) as balance,
MIN(p.pc) as min_pc,
MAX(p.pc) as max_pc
FROM
tbl_results r
INNER JOIN
tbl_phases p ON p.runID_fk = r.runID
GROUP BY
r.runID
HAVING
MIN(p.pc)>10 AND MAX(p.pc) < 50;
runID
balance
min_pc
max_pc
2
5348
15
20
Query #2
SELECT
r.runID,
MAX(r.balance) as balance,
MIN(p.pc) as min_pc,
MAX(p.pc) as max_pc
FROM
tbl_results r
INNER JOIN
tbl_phases p ON p.runID_fk = r.runID
GROUP BY
r.runID
HAVING
COUNT(CASE WHEN p.pc <= 10 or p.pc >= 50 THEN 1 END) =0;
runID
balance
min_pc
max_pc
2
5348
15
20
View working demo on DB Fiddle
Updated with comments from Rahul Biswas

How to get the difference between consecutive rows in MySQL?

I have a table in mysql database this data.
id date number qty
114 07-10-2018 200 5
120 01-12-2018 300 10
123 03-02-2019 700 12
1126 07-03-2019 1000 15
I want to calculate difference between two consecutive rows and i need output format be like:
id date number diff qty avg
114 07-10-2018 200 0 5 0
120 01-12-2018 300 100 10 10
123 03-02-2019 700 400 12 33.33
1126 07-03-2019 1000 300 15 20
Any one know how to do this in mysql query? I want first value of diff and avg column to be 0 and rest is the difference.
For MySQL 8 then use Lag window function.
SELECT
test.id,
test.date,
test.number,
test.qty,
IFNULL(test.number - LAG(test.number) OVER w, 0) AS diff,
ROUND(IFNULL(test.number - LAG(test.number) OVER w, 0)/ test.qty, 2) AS 'Avg'
FROM purchases test
WINDOW w AS (ORDER BY test.`date` ASC);
For MySQL 5.7 or lesser version
We can use the MySQL variable to do this job. Consider your table name is test.
SELECT
test.id,
test.date,
test.number,
test.qty,
#diff:= IF(#prev_number = 0, 0, test.number - #prev_number) AS diff,
ROUND(#diff / qty, 2) 'avg',
#prev_number:= test.number as dummy
FROM
test,
(SELECT #prev_number:= 0 AS num) AS b
ORDER BY test.`date` ASC;
-------------------------------------------------------------------------------
Output:
| id | date | number| qty | diff | avg | dummy |
-----------------------------------------------------------------
| 114 | 2018-10-07 | 200 | 5 | 0 | 0.00 | 200 |
| 120 | 2018-12-01 | 300 | 10 | 100 | 10.00 | 300 |
| 123 | 2019-02-03 | 700 | 12 | 400 | 33.33 | 700 |
| 1126 | 2019-03-07 | 1000 | 15 | 300 | 20.00 | 1000 |
Explaination:
(SELECT #prev_number:= 0 AS num) AS b
we initialized variable #prev_number to zero in FROM clause and joined with each row of the test table.
#diff:= IF(#prev_number = 0, 0, test.number - #prev_number) AS diff First we are generating difference and then created another variable diff to reuse it for average calculation. Also we included one condition to make the diff for first row as zero.
#prev_number:= test.number as dummy we are setting current number to this variable, so it can be used by next row.
Note: We have to use this variable first, in both difference as well as average and then set to the new value, so next row can access value from the previous row.
You can skip/modify order by clause as per your requirements.
There could be better ways to do this, but try this:
SELECT A.id,
A.date,
A.number,
A.qty,
A.diff,
B.avg
FROM
(SELECT *, abs(LAG(number, 1, number) OVER (ORDER BY id) - number) AS 'diff'
FROM table) AS A
JOIN
(SELECT *, abs(LAG(number, 1, number) OVER (ORDER BY id) - number)/qty AS 'avg' FROM table) AS B
ON A.id = B.id;

Is this select subquery avoidable?

I have two tables (Invoices and taxes) in mysql:
Invoices:
- id
- account_id
- issued_at
- total
- gross_amount
- country
Taxes:
- id
- invoice_id
- tax_name
- tax_rate
- taxable_amount
- tax_amount
I'm trying to retrive a report like this
rep_month | country | total_amount | tax_name | tax_rate(%) | taxable_amount | tax_amount
--------------------------------------------------------------------------------------
2017-01-01 | ES | 1000 | TAX1 | 21 | 700 | 147
2017-01-01 | ES | 1000 | TAX2 | -15 | 700 | 105
2016-12-01 | FR | 100 | TAX4 | 20 | 30 | 6
2016-12-01 | FR | 100 | B2B | 0 | 70 | 0
2017-01-01 | GB | 2500 | TAX3 | 20 | 1000 | 200
The idea behind this is that an invoice has a has_many relation with taxes. So an invoice can have or not taxes. The report should show the total amount collected (total_amount) for a given country (regardess if it includes taxes)
and indicate which part of that total amount is taxable (taxable_amount) for an specific tax.
My current approach is this one:
SELECT
DATE_FORMAT(invoices.issued_at, '%Y-%m-01') AS rep_month,
invoices.country AS country
( SELECT sum(docs.gross_amount)
FROM invoices AS docs
WHERE docs.country = invoices.country
AND DATE_FORMAT(docs.issue_date, '%Y-%m-01') = rep_month
) AS total_amount,
taxes.tax_name AS tax_name,
taxes.tax_rate AS tax_rate,
SUM(taxes.taxable_amount) AS taxable_amount,
SUM(taxes.tax_amount) AS tax_amount
FROM invoices
JOIN taxes ON invoices.id = taxes.document_id
AND documents.issue_date BETWEEN '2016-01-01' AND '2017-12-31'
GROUP BY account_id, rep_month, country, tax_name, tax_rate
ORDER BY country desc
Well, this works but for a real dataset (thousands of records) it's really slow as the select subquery for retrieving the total_amount is being run for each row of the report.
I cannot make a LEFT JOIN taxes with a direct SUM(gross_amount) as the GROUP BY groups by tax name and rate and I need to show the total collected per country regardless if the amount was taxed or not. Is there a faster alternative to this?
I do not know the exact use case of using this query but the issue is the way with which you're trying to structure the DB, you're trying to get the entire data in one go.
Ideally, you should run the query you have and store in a different table (summary table) and then query directly from the summary table whenever you want. And if you have a new entry in the Invoices table then you can use the query to run either on every entry or periodically update the summary table via a cronjob.

Retrieve distinct values without reducing number of results

I'm writing a MySQL request for retrieving data from a list of questions.
The table looks like this :
-----------------------------------------------------
| id | answer_name | rating | question_id | answers |
-----------------------------------------------------
Where several rows can have the same answer_name value, since several questions can be asked about the same answer.
Now, for retrieving the data I use a LIMIT clause which is calculated from ratings and the total number of rows.
For example, if I wanna get the data between 80% and 100% of rating, and there are 100 rows, I would use ORDER BY rating LIMIT 80, 20.
My problem is the following : I need to retrieve data with distinct values for answer_name column, but using a GROUP BY clause makes the number of result (e.g. of rows in the table) reduce cause of aggregation, causing the top percentages of rows to return nothing cause of searching rows at a limit that doesn't exist.
Does anyone know if there is a way to keep the number of results the same and still to retrieve distinct results for the answer_name column ?
EDIT :
Here are some sample rows and expected output :
game_data table :
-----------------------------------------------------
| id | answer_name | rating | question_id | answers |
|----|-------------|--------|-------------|---------|
| 1 | A. Merkel | 40 | 1 | [1,2,3] |
| 2 | A. Merkel | 45 | 2 | [2,3,4] |
| 3 | B. Clinton | 55 | 1 | [2,5,8] |
| 4 | B. Clinton | 50 | 2 | [3,5,8] |
| 5 | L. Messi | 17 | 4 | [7,8,9] |
| 6 | L. Messi | 18 | 5 | [7,8,9] |
| 7 | L. Messi | 25 | 6 | [7,8,9] |
| 8 | D. Beckham | 21 | 4 | [6,7,8] |
| 9 | D. Beckham | 52 | 5 | [6,7,8] |
| 10 | D. Beckham | 41 | 6 | [6,7,8] |
-----------------------------------------------------
Where answers is an array of ids referring to another table.
Let's say I wanna retrieve the 50% to 80% of the table, ordered by rating.
SELECT id FROM game_data GROUP BY answer_name ORDER BY rating LIMIT 5, 3
Here the problem is the GROUP BY answer_name is gonna reduce the number of rows of the table, and therefore instead of returning 3 results, will return an empty set.
Also, I want the selected value in the GROUP BY close to be randomly chosen.
Using group by like this goes against pretty much every instinct, but you said you want random values, so it's good enough.
select * from (
select q.*, #rank := #rank + 1 as rank
from (
select * from game_data
group by answer_name
order by rating desc
) q, (select #rank := 0) qq
) qqq
where rank between (#rank * .5) and (#rank * .8)
demo here
How does it work? First (in the innermost query) we group by your answer_name, to get your distinct results, and we order it by the rating as required.
Then in the query wrapping around that one, we give those results a ranking from 1 to however many rows are in the result. Once this level of the query completes, we know our best answer is answer 1, and our 'worst' answer is the last value of our #rank variable.
Then we get to the outermost query. We can use that #rank variable to determine our percentages, which we use to filter the where clause.
In all likelihood this will give you the same results each time you run the same query, but the values chosen are indeterminate - so it could change. If you want truly random (ie changes with each execution) that's a different kettle of fish altogether.
(note, this bit: , (select #rank := 0) qq is purely to initialise the variable)
Simple is That.
Use Group By 'id' not 'answer_name' b/c Group By not get duplicate values
SELECT * FROM game_data GROUP BY id ORDER BY rating