Distinguish between NULL's when using "group by ... with rollup" - mysql

When I run a query using group by ... with rollup:
select a, b, sum(c)
from <table>
group by a, b with rollup;
I get duplicate rows in (what I consider to be) the PK of the query (that is, the group-by columns):
+------+------+--------+
| a | b | sum(c) |
+------+------+--------+
| NULL | NULL | 13 |
| NULL | 1 | 4 |
| NULL | 3 | 8 |
| NULL | 4 | 9 |
| NULL | NULL | 34 |
| 1 | 3 | 17 |
| 1 | 4 | NULL |
| 1 | 17 | 2 |
| 1 | NULL | 19 |
| 2 | NULL | 6 |
| 2 | 1 | 17 |
| 2 | 3 | 17 |
| 2 | NULL | 40 |
| 4 | 17 | 2 |
| 4 | NULL | 2 |
| 5 | NULL | 11 |
| 5 | 6 | 7 |
| 5 | NULL | 18 |
| 13 | 4 | 2 |
| 13 | NULL | 2 |
| 14 | 41 | 3 |
| 14 | NULL | 3 |
| 18 | 1 | 2 |
| 18 | NULL | 2 |
| 41 | 2 | 17 |
| 41 | NULL | 17 |
... more rows follow ...
How do I distinguish (NULL, NULL, 13) from (NULL, NULL, 34)? That is, how do I distinguish between the row that has nulls because of the underlying data, and the row that has nulls because it was added by rollup? (Note that there are more examples -- (2, NULL, 6) and (2, NULL, 40))

Good question. One option I can think of is to do this:
select COALESCE(a, -1) AS a, COALESCE(b, -1) AS b, sum(c)
from <table>
group by COALESCE(a, -1), COALESCE(b, -1) with rollup;

Answer from Cade Roux does not work for me (MySQL v5.1) and seems to be inconsistent from version to version. A method proposed on MySQL documentation comments is the only reliable method I've seen:
http://dev.mysql.com/doc/refman/5.6/en/group-by-modifiers.html
Posted by Peter Kioko on June 27 2012 2:04pm
If you are grouping a column whose data contains NULLs then a NULL in the results is ambiguous as to whether it designates an actual data value or a rolled-up row.
In order to definitively know if a row is a rolled-up row or not you can use this trick:
SET #i = 0;
SELECT #i := #i + 1 AS row_num, year, country, product, SUM(profit) FROM sales GROUP BY year, country, product WITH ROLLUP;
In the result-set, any row whose row_num is the same value as the previous row's row_num is a rolled-up row and vice-versa.

Related

Assigning passengers to buses based on bus capacity in MySQL

Problem: Buses and passengers arrive at a station. If a bus arrives at the station at a time tbus and a passenger arrives at a time tpassenger where tpassenger <= tbus, then the passenger will attempt to use the first available bus whose capacity has not been exceeded. If at the moment the bus arrives at the station there are more passengers waiting than its capacity capacity, only capacity passengers will use the bus.
Write a SQL query to report the users that appear on each bus (if two passengers arrive at the same time, then the passenger with the smaller passenger_id value should be given priority). The query result format is in the following example (schema and table descriptions appear at the end of this post).
Example
Input:
Buses table:
+--------+--------------+----------+
| bus_id | arrival_time | capacity |
+--------+--------------+----------+
| 1 | 2 | 1 |
| 2 | 4 | 10 |
| 3 | 7 | 2 |
+--------+--------------+----------+
Passengers table:
+--------------+--------------+
| passenger_id | arrival_time |
+--------------+--------------+
| 11 | 1 |
| 12 | 1 |
| 13 | 5 |
| 14 | 6 |
| 15 | 7 |
+--------------+--------------+
Output:
+--------+----------+-----------+------+--------------+-----------+
| bus_id | capacity | b_arrival | spot | passenger_id | p_arrival |
+--------+----------+-----------+------+--------------+-----------+
| 1 | 1 | 2 | 1 | 11 | 1 |
| 2 | 10 | 4 | 1 | 12 | 1 |
| 2 | 10 | 4 | 2 | NULL | NULL |
| 2 | 10 | 4 | 3 | NULL | NULL |
| 2 | 10 | 4 | 4 | NULL | NULL |
| 2 | 10 | 4 | 5 | NULL | NULL |
| 2 | 10 | 4 | 6 | NULL | NULL |
| 2 | 10 | 4 | 7 | NULL | NULL |
| 2 | 10 | 4 | 8 | NULL | NULL |
| 2 | 10 | 4 | 9 | NULL | NULL |
| 2 | 10 | 4 | 10 | NULL | NULL |
| 3 | 2 | 7 | 1 | 13 | 5 |
| 3 | 2 | 7 | 2 | 14 | 6 |
+--------+----------+-----------+------+--------------+-----------+
Explanation:
Passenger 11 arrives at time 1.
Passenger 12 arrives at time 1.
Bus 1 arrives at time 2 and collects passenger 11 as it has one empty seat.
Bus 2 arrives at time 4 and collects passenger 12 as it has ten empty seats.
Passenger 13 arrives at time 5.
Passenger 14 arrives at time 6.
Passenger 15 arrives at time 7.
Bus 3 arrives at time 7 and collects passengers 13 and 14 as it has two empty seats.
Attempt
The CTE
WITH RECURSIVE bus_spots AS (
SELECT B.bus_id, B.arrival_time AS b_arrival, B.capacity, 1 AS spot FROM Buses B
UNION ALL
SELECT BS.bus_id, BS.b_arrival, BS.capacity, BS.spot + 1 FROM bus_spots BS WHERE BS.spot < BS.capacity
) SELECT * FROM bus_spots ORDER BY bus_id, spot;
gives
+--------+-----------+----------+------+
| bus_id | b_arrival | capacity | spot |
+--------+-----------+----------+------+
| 1 | 2 | 1 | 1 |
| 2 | 4 | 10 | 1 |
| 2 | 4 | 10 | 2 |
| 2 | 4 | 10 | 3 |
| 2 | 4 | 10 | 4 |
| 2 | 4 | 10 | 5 |
| 2 | 4 | 10 | 6 |
| 2 | 4 | 10 | 7 |
| 2 | 4 | 10 | 8 |
| 2 | 4 | 10 | 9 |
| 2 | 4 | 10 | 10 |
| 3 | 7 | 2 | 1 |
| 3 | 7 | 2 | 2 |
+--------+-----------+----------+------+
as its result set while
WITH bus_queue AS (
SELECT
P.passenger_id,
P.arrival_time AS p_arrival,
ROW_NUMBER() OVER(ORDER BY P.arrival_time, P.passenger_id) AS queue_pos
FROM Passengers P
) SELECT * FROM bus_queue ORDER BY p_arrival, passenger_id;
gives
+--------------+-----------+-----------+
| passenger_id | p_arrival | queue_pos |
+--------------+-----------+-----------+
| 11 | 1 | 1 |
| 12 | 1 | 2 |
| 13 | 5 | 3 |
| 14 | 6 | 4 |
| 15 | 7 | 5 |
+--------------+-----------+-----------+
as its result set. But I'm not sure how to effectively relate the CTE result sets (or if this is even the best way of going about things), especially given the complications introduced by handling capacity effectively.
Question: Any ideas on how to work out a solution for this kind of problem (preferably without using variables)? For reference, I'm using MySQL 8.0.26.
Schema and Table Descriptions
Schema:
DROP TABLE IF EXISTS Buses;
CREATE TABLE IF NOT EXISTS
Buses (bus_id int, arrival_time int, capacity int);
INSERT INTO
Buses (bus_id, arrival_time, capacity)
VALUES
(1, 2, 1),
(2, 4, 10),
(3, 7, 2);
DROP TABLE IF EXISTS Passengers;
CREATE TABLE IF NOT EXISTS
Passengers (passenger_id int, arrival_time int);
INSERT INTO
Passengers (passenger_id, arrival_time)
VALUES
(11, 1),
(12, 1),
(13, 5),
(14, 6),
(15, 7);
Table descriptions:
Buses:
+--------------+------+
| Column Name | Type |
+--------------+------+
| bus_id | int |
| arrival_time | int |
| capacity | int |
+--------------+------+
bus_id is the primary key column for this table.
Each row of this table contains information about the arrival time of a bus at the station and its capacity (i.e., the number of empty seats it has).
There will be no two buses that arrive at the same time and capacity will be a positive integer.
Passengers:
+--------------+------+
| Column Name | Type |
+--------------+------+
| passenger_id | int |
| arrival_time | int |
+--------------+------+
passenger_id is the primary key column for this table.
Each row of this table contains information about the arrival time of a passenger at the station.
Using a recursive cte and several successive ctes:
with recursive cte(id, a, c, s) as (
select b.*, 1 from buses b
union all
select c.id, c.a, c.c, c.s + 1 from cte c where c.s+1 <= c.c
),
_passengers as (
select row_number() over (order by p.passenger_id) n, p.* from passengers p
),
gps(bid, n, a, pid) as (
select b.bus_id, p.n, p.arrival_time, p.passenger_id from buses b
join _passengers p on p.arrival_time <= b.arrival_time and not exists
(select 1 from buses b1 where b1.arrival_time < b.arrival_time and p.arrival_time <= b1.arrival_time)
),
slts(v, n, a, pid) as (
select case when
(select sum(g.bid = g1.bid and g1.n <= g.n) from gps g1) <= (select sum(c.id = g.bid) from cte c)
then g.bid else null end, g.n, g.a, g.pid from gps g
),
dists as (
select case when s.v is not null
then s.v
else (select min(b.bus_id) from buses b where b.arrival_time >= s.a and
(select sum(s2.v is null and s2.n <= s.n) from slts s2) <
(select sum(c3.id = b.bus_id) from cte c3)) end v,
s.a, s.pid from slts s
)
select c.id bus_id, c.c capacity, c.a arrival_time, c.s spot, p.pid passenger_id, p.a arrival from cte c
left join (select (select sum(d.v = d1.v and d1.a < d.a) from dists d1) + 1 r,
d.* from dists d where d.v is not null) p
on c.id = p.v and c.s = p.r
order by c.a, c.s

How can I treat with NULL as minimum value?

I have a table like this:
// notifications
+----+-----------+-------+---------+---------+------+
| id | score | type | post_id | user_id | seen |
+----+-----------+-------+---------+---------+------+
| 1 | 15 | 1 | 2342 | 342 | 1 |
| 2 | 5 | 1 | 2342 | 342 | 1 |
| 3 | NULL | 2 | 5342 | 342 | 1 |
| 4 | -10 | 1 | 2342 | 342 | NULL |
| 5 | 5 | 1 | 2342 | 342 | NULL |
| 6 | NULL | 2 | 8342 | 342 | NULL |
| 7 | -2 | 1 | 2342 | 342 | NULL |
+----+-----------+-------+---------+---------+------+
-- type: 1 means "it is a vote", 2 means "it is a comment (without score)"
Here is my query:
SELECT SUM(score), type, post_id, seen
FROM notifications
WHERE user_id = 342
GROUP BY type, post_id
ORDER BY (seen IS NULL) desc
As you see, there is SUM() function, Also both type and post_id columns are in the GROUP BY statement. Well now I'm talking about seen column. I don't want to put it into GROUP BY statement. So I have to use either MAX() or MIN() for it. Right?
Actually I need to select NULL as seen column (in query above) if there is even one row which has seen = NULL. My current query selects 1 as seen's value, even when I use MIN(seen). So why 1 is minimum when there is NULL?
Also I want to order rows so that all SEEN = NULL be in the top of list. How can I do that?
Expected result:
// notifications
+-----------+-------+---------+------+
| score | type | post_id | seen |
+-----------+-------+---------+------+
| 13 | 1 | 2342 | NULL |
| NULL | 2 | 8342 | NULL |
| NULL | 2 | 5342 | 1 |
+-----------+-------+---------+------+
You could do this
case when sum(seen is null) > 0
then null
else min(seen)
end
You could use the following query:
SELECT SUM(score), type, post_id, min(IFNULL(seen, 0)) as seen
FROM notifications
WHERE user_id = 342
GROUP BY type, post_id
ORDER BY seen desc

How to display lines as dynamic columns with other values?

I want to display orders item that have the collect_id = 2
And I want to display all the fields related to each order_item as columns with values.
These are the tables and the result :
+-------------------------------+
| order_item |
+-------------------------------+
| oi_id oi_price oi_collect_id |
| 1 100 2 |
| 2 30 2 |
| 3 55 3 |
| 4 70 4 |
| 5 220 2 |
| 6 300 4 |
+-------------------------------+
+-----------------------------------+
| field_value |
+-----------------------------------+
| v_value v_fk_field_id oi_fk_id |
| Peter 1 1 |
| Lagaf 2 1 |
| Football 3 1 |
| Male 4 1 |
| 12345678 5 1 |
| Frank 1 2 |
| Loran 2 2 |
| Tennis 3 2 |
| Male 4 2 |
| 11223658 5 2 |
| Nathali 1 5 |
| Waton 2 5 |
| Reading 3 5 |
+-----------------------------------+
oi_fk_id : foreign key ref(order_item.oi_id)
v_fk_field_id : foreign key ref(field.f_id)
+--------------------+
| field |
+--------------------+
| f_id f_label |
| 1 surname |
| 2 name |
| 3 hobbies |
| 4 sex |
| 5 phone |
+--------------------+
+-----------------------------------------------------------------------------+
| Result |
+-----------------------------------------------------------------------------+
| oi_id oi_price oi_collect_id surname name hobbies sex phone |
| 1 100 2 Peter Lagaf Football Male 12345678 |
| 2 30 2 Frank Loran Tennis Male 11223658 |
| 5 220 2 Nathali Waton Reading null null |
+-----------------------------------------------------------------------------+
Important : The table field does not contain only these 5 fields (name, surname, hobbies, sex, phone), but it can contain many others, that the developper may not know, same thing for the correspondant value on the table 'field_value'.
PS : I didn't make field labels as columns in a table because they are dynamic and not limited, and in the front end application, the user can add new fields as he want.
You can take advantage of dynamic pivoting to get the results:
SELECT GROUP_CONCAT(t.line)
FROM (
SELECT CONCAT('MAX(IF(t.l=''', f.f_label, ''',t.v,NULL)) AS ', f.f_label) AS line
FROM field f
) AS t
INTO #dynamic;
SELECT CONCAT('SELECT t.oi_id, t.oi_price, t.oi_collect_id,',
#dynamic,
' FROM ( SELECT oi.*, f.f_label AS l, fv.v_value AS v FROM order_item oi JOIN field_value fv ON fv.oi_fk_id = oi.oi_id JOIN field f ON f.f_id = fv.v_fk_field_id WHERE oi.oi_collect_id = 2 ) AS t GROUP BY t.oi_id;')
INTO #sql;
PREPARE stmt FROM #sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
However it is limited by GROUP_CONCAT function:
The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024. The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet.
EDIT - Why MAX function is required?
To solve the given problem, we are creating a dynamic query, based on the content of the field table.
For the given exemplary data, the query without use of MAX function would be:
SELECT t.oi_id,
t.oi_price,
t.oi_collect_id,
IF(t.l='surname', t.v, NULL) AS surname,
IF(t.l='name', t.v, NULL) AS name,
IF(t.l='hobbies', t.v, NULL) AS hobbies,
IF(t.l='sex', t.v, NULL) AS sex,
IF(t.l='phone', t.v, NULL) AS phone
FROM (
SELECT oi.*,
f.f_label AS l,
fv.v_value as V
FROM order_item oi
JOIN field_value fv
ON fv.oi_fk_id = oi.oi_id
JOIN field f
ON f.f_id = fv.v_fk_field_id
WHERE oi.oi_collect_id = 2
) AS t;
Which wouldd result in:
+-------+----------+---------------+---------+-------+----------+------+----------+
| oi_id | oi_price | oi_collect_id | surname | name | hobbies | sex | phone |
+-------+----------+---------------+---------+-------+----------+------+----------+
| 1 | 100 | 2 | Peter | NULL | NULL | NULL | NULL |
| 2 | 30 | 2 | Frank | NULL | NULL | NULL | NULL |
| 5 | 220 | 2 | Nathali | NULL | NULL | NULL | NULL |
| 1 | 100 | 2 | NULL | Lagaf | NULL | NULL | NULL |
| 2 | 30 | 2 | NULL | Loran | NULL | NULL | NULL |
| 5 | 220 | 2 | NULL | Waton | NULL | NULL | NULL |
| 1 | 100 | 2 | NULL | NULL | Football | NULL | NULL |
| 2 | 30 | 2 | NULL | NULL | Tennis | NULL | NULL |
| 5 | 220 | 2 | NULL | NULL | Reading | NULL | NULL |
| 1 | 100 | 2 | NULL | NULL | NULL | Male | NULL |
| 2 | 30 | 2 | NULL | NULL | NULL | Male | NULL |
| 1 | 100 | 2 | NULL | NULL | NULL | NULL | 12345678 |
| 2 | 30 | 2 | NULL | NULL | NULL | NULL | 11223658 |
+-------+----------+---------------+---------+-------+----------+------+----------+
This is an intermediate result, where each row consists of value for one field and NULL for the others. The MAX function together with a GROUP BY clause is used to combine multiple rows concerning one order item in such a way, that it chooses non null values. It could be replaced by MIN function, which will also favor existing values over null.

Proper Indexing MySQL Table

I can't seem to get this query to perform any faster than 8 hours! 0_0
I have read up on indexing and I am still not sure I am doing this right.
I am expecting my query to calculate a value for BROK_1_RATING based on dates and other row values - 500,000 records.
Using record #1 as an example - my query should:
get all other records that have the same ESTIMID
ignore records where ANALYST =""
ignore records where ID is the same as record being compared i.e.
ID != 1
the records must fall within a time frame
i.e. BB.ANNDATS_CONVERTED <= working.ANNDATS_CONVERTED,
BB.REVDATS_CONVERTED > working.ANNDATS_CONVERTED
BB.IRECCD must = 1
Then count the result
Then write the count value to the BROK_1_RATING column for record #1
now do same for record#2, and #3 and so on for the entire table
In human terms - "Examine the date of record #1 - Now, within time frame from record #1 - count the number of times the number 1 exists with the same brokerage ESTIMID, do not count record #1, do not count blank ANALYST rows. Move on to record #2 and do the same"
UPDATE `working` SET `BROK_1_RATING` =
(SELECT COUNT(`ID`) FROM (SELECT `ID`, `IRECCD`, `ANALYST`, `ESTIMID`, `ANNDATS_CONVERTED`, `REVDATS_CONVERTED` FROM `working`) AS BB
WHERE
BB.`ANNDATS_CONVERTED` <= `working`.`ANNDATS_CONVERTED`
AND
BB.`REVDATS_CONVERTED` > `working`.`ANNDATS_CONVERTED`
AND
BB.`ID` != `working`.`ID`
AND
BB.`ESTIMID` = `working`.`ESTIMID`
AND
BB.`ANALYST` != ''
AND
BB.`IRECCD` = 1
)
WHERE `working`.`ANALYST` != '';
| ID | ANALYST | ESTIMID | IRECCD | ANNDATS_CONVERTED | REVDATS_CONVERTED | BROK_1_RATING | NO_TOP_RATING |
------------------------------------------------------------------------------------------------------------------
| 1 | DAVE | Brokerage000 | 4 | 1998-07-01 | 1998-07-04 | | 3 |
| 2 | DAVE | Brokerage000 | 1 | 1998-06-28 | 1998-07-10 | | 4 |
| 3 | DAVE | Brokerage000 | 5 | 1998-07-02 | 1998-07-08 | | 2 |
| 4 | DAVE | Brokerage000 | 1 | 1998-07-04 | 1998-12-04 | | 3 |
| 5 | SAM | Brokerage000 | 1 | 1998-06-14 | 1998-06-30 | | 4 |
| 6 | SAM | Brokerage000 | 1 | 1998-06-28 | 1999-08-08 | | 4 |
| 7 | | Brokerage000 | 1 | 1998-06-28 | 1999-08-08 | | 5 |
| 8 | DAVE | Brokerage111 | 2 | 1998-06-28 | 1999-08-08 | | 3 |
'EXPLAIN' results:
id| select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----------------------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | working | index | ANALYST | PRIMARY | 4 | NULL | 467847 | Using where
2 | DEPENDENT SUBQUERY | <derived3> | ALL | NULL | NULL | NULL | NULL | 467847 | Using where
3 | DERIVED | working | index | NULL | test_combined_indexes | 226 | NULL | 467847 | Using index
I have indexes on the single columns - and as well - have tried multiple column index like this:
ALTER TABLE `working` ADD INDEX `test_combined_indexes` (`IRECCD`, `ID`, `ANALYST`, `ESTIMID`, `ANNDATS_CONVERTED`, `REVDATS_CONVERTED`) COMMENT '';
Well you can shorten the query a lot by just removing the extra stuff:
UPDATE `working` as AA SET `BROK_1_RATING` =
(SELECT COUNT(`ID`) FROM `working` AS BB
WHERE BB.`ANNDATS_CONVERTED` <= AA.`ANNDATS_CONVERTED`
AND BB.`REVDATS_CONVERTED` > AA.`ANNDATS_CONVERTED`
AND BB.`ID` != AA.`ID`
AND BB.`ESTIMID` = AA.`ESTIMID`
AND BB.`ANALYST` != ''
AND BB.`IRECCD` = 1 )
WHERE `ANALYST` != '';

SQL reduce number of columns in inner query

I have a query:
select
count(*), paymentOptionId
from
payments
where
id in (select min(reportDate), id
from payments
where userId in (select distinct userId
from payments
where paymentOptionId in (46,47,48,49,50,51,52,53,54,55,56))
group by userId)
group by
paymentOptionId;
The problem place is "select min(reportDate), id", this query must return 1 column result, but I can't realize how to do it while I need to group min.
The data set looks like
+----+--------+--------+-----------+---------------------+--------+----------+-----------------+
| id | userId | amount | userLevel | reportDate | buffId | bankQuot | paymentOptionId |
+----+--------+--------+-----------+---------------------+--------+----------+-----------------+
| 9 | 12012 | 5 | 5 | 2014-02-10 23:07:57 | NULL | NULL | 2 |
| 10 | 12191 | 5 | 6 | 2014-02-10 23:52:12 | NULL | NULL | 2 |
| 11 | 12295 | 5 | 6 | 2014-02-11 00:12:04 | NULL | NULL | 2 |
| 12 | 12295 | 5 | 6 | 2014-02-11 00:12:42 | NULL | NULL | 2 |
| 13 | 12256 | 5 | 6 | 2014-02-11 00:26:25 | NULL | NULL | 2 |
| 14 | 12256 | 5 | 6 | 2014-02-11 00:26:35 | NULL | NULL | 2 |
| 16 | 12510 | 5 | 5 | 2014-02-11 00:42:58 | NULL | NULL | 2 |
| 17 | 12510 | 5 | 5 | 2014-02-11 00:43:08 | NULL | NULL | 2 |
| 18 | 12510 | 18 | 5 | 2014-02-11 00:45:16 | NULL | NULL | 3 |
| 19 | 12510 | 5 | 6 | 2014-02-11 01:00:10 | NULL | NULL | 2 |
+----+--------+--------+-----------+---------------------+--------+----------+-----------------+
select count(*), paymentOptionId
from
(select userId, min(reportdate), paymentOptionId
from payments as t1
group by userId, paymentOptionId) as t2
group by paymentOptionId
Fiddle
It first gets the minimum report date (so the first entry) for every user, for every type (so there are two records for a user who has 2 types) and then counts them grouping by type (aka paymentOptionId).
By the way, you can of course cut the attributes chosen in select in from clause, they are only there so you can copy-paste it and see the results it is giving step by step.
You seem to want to report on various payment options and their counts for the earliest ReportDate for each user.
If so, here is an alternative approach
select p.paymentOptionId, count(*)
from payments p
where paymentOptionId in (46,47,48,49,50,51,52,53,54,55,56) and
not exists (select 1
from payments p2
where p2.userId = p.userId and
p2.ReportDate < p.ReportDate
)
group by paymentOptionId;
This isn't exactly the same as your query, because this will only report on the list of payment types, whereas you might want the first payment type for anyone who has ever had one of these types. I'm not sure which you want, though.