Using GROUP BY with MAX()/MIN() giving bad results - mysql

The table
The query
SELECT
id, MAX(fecha_hora_carga) AS fecha_hora_carga
FROM
calibraciones_instrumentos
GROUP BY
instrumento_id
The result
Its returning the most recent fecha_hora_carga dates, but the ids are 24 and 28...i think they should be 27 and 29!
Why are the ids not corresponding with the date?

The problem is MySQL does not make much sense when grouping by a max value.
It grabs the max column and then the other columns in that table you selected by whatever order you sort them by.
To get what you want, you have to use subqueries to pull the data you want.
Here is an example:
SELECT
t1.id,
t1.fecha_hora_carga
FROM
calibraciones_instrumentos AS t1
JOIN(
SELECT MAX(fecha_hora_carga) AS fecha_hora_carga,
instrument_id
FROM
calibraciones_instrumentos
GROUP BY
instrument_id
) AS t2
ON (t1.fecha_hora_carga = t2.fecha_hora_carga AND
t1.instrument_id = t2.instrument_id
);

Because you are misusing SQL. You have one column in the GROUP BY clause and that column isn't even being selected!
In most databases -- including the most recent versions of MySQL -- your query would generate a syntax error because id is neither in the GROUP BY nor an argument to an aggregation function such as MIN().
So, MySQL is providing just an arbitrary id. I would expect an aggregation query to look like this:
SELECT instrumento_id, MAX(fecha_hora_carga) AS fecha_hora_carga
FROM calibraciones_instrumentos
GROUP BY instrumento_id;
Or, if you want the row with the maximum fecha_hora_carga for each instrumento_id, use filtering:
select ci.*
from calibraciones_instrumentos ci
where ci.fecha_hora_carga = (select max(ci2.fecha_hora_carga)
from calibraciones_instrumentos ci2
where ci2.instrumento_id = ci.instrumento_id
);

This is because your query is incorrect
The MAX is an aggregate function and gets the max. value from the fecha_hora_carga, this won't give you the corresponding id too it just gets the maximum value stored in the fecha_hora_carga column, not a row.
See the following sample:
mysql>CREATE TABLE test_group_by (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, val1 INT, val2 INT);`
mysql>INSERT INTO test_group_by (val1, val2) VALUES(10,1), (6, 1), (18, 1), (22, 2), (4, 2);
mysql> SELECT * FROM test_group_by;
+----+------+------+
| id | val1 | val2 |
+----+------+------+
| 1 | 10 | 1 |
| 2 | 6 | 1 |
| 3 | 18 | 1 |
| 4 | 22 | 2 |
| 5 | 4 | 2 |
+----+------+------+
mysql> SELECT id, MAX(val1) FROM test_group_by GROUP BY val2;
+----+-----------+
| id | MAX(val1) |
+----+-----------+
| 1 | 18 |
| 4 | 22 |
+----+-----------+
As you can see in the example, that is a simplified representation of your table.
The MAX function does not retrieves a entry, just the max. value of all the entries in the table. But your query also asks for a ID, it just makes one up (which ID is returned cannot be said for sure).

Related

Joining table to itself with multiple join criteria logic

I'm trying to understand the logic behind the syntax below. Based on the following question, table and syntax:
Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.
Column + Data Type:
id: int | user_id: int | item: varchar |created_at: datetime | revenue: int
SELECT DISTINCT(a1.user_id)
FROM amazon_transactions a1
JOIN amazon_transactions a2 ON a1.user_id=a2.user_id
AND a1.id <> a2.id
AND a2.created_at::date-a1.created_at::date BETWEEN 0 AND 7
ORDER BY a1.user_id
Why does the table need to be joined to itself in this case?
How does 'AND a1.id <> a2.id' portion of syntax contribute to the join?
You are looking for users that have 2 records on that table whose date distance is lower (or equal) than 7 days
To accomplish this, you treat the table as if it were 2 different (but equal tables) because you have to match a row on the first table with a row on the second table
Of course you don't want to match a row with itself, so
AND a1.id <> a2.id
accomplishes that
The table needs to be joined with itself because, you just have one table, and you want to find out returning users (by comparing the duration between transaction dates for the same user).
AND a1.id <> a2.id portion of the syntax removes the same transactions, i.e. prevents the transactions with the same id to be included in the joined table.
There are two scenarios I can think of based on the id column values. Are id column values generated based on timely sequence ? If so, to answer your first question ,we can but don't have to use join syntax. Here is how to achieve your goal using a correlated subquery , with sample data created.
create table amazon_transactions(id int , user_id int , item varchar(20),created_at datetime , revenue int);
insert amazon_transactions (id,user_id,created_at) values
(1,1,'2020-01-05 15:33:22'),
(2,2,'2020-01-05 16:33:22'),
(3,1,'2020-01-08 18:33:22'),
(4,1,'2020-01-22 17:33:22'),
(5,2,'2020-02-05 15:33:22'),
(6,2,'2020-03-05 15:33:22');
select * from amazon_transactions;
-- sample set:
| id | user_id | item | created_at | revenue |
+------+---------+------+---------------------+---------+
| 1 | 1 | NULL | 2020-01-05 15:33:22 | NULL |
| 2 | 2 | NULL | 2020-01-05 16:33:22 | NULL |
| 3 | 1 | NULL | 2020-01-08 18:33:22 | NULL |
| 4 | 1 | NULL | 2020-01-22 17:33:22 | NULL |
| 5 | 2 | NULL | 2020-02-05 15:33:22 | NULL |
| 6 | 2 | NULL | 2020-03-05 15:33:22 | NULL |
-- Here is the answer using a correlated subquery:
select distinct user_id
from amazon_transactions t
where datediff(
(select created_at from amazon_transactions where user_id=t.user_id and id-t.id>=1 limit 1 ),
created_at
)<=7
;
-- result:
| user_id |
+---------+
| 1 |
However,what if the id values are NOT transaction time based? Then the id values are not at all helpful in our requirement. In this case, a JOIN is more capable than a correlated subquery and we need to arrange the order based on transaction time for each user in order to make the necessary join condition. And to answer your second question, the AND a1.id <> a2.id portion of syntax contribute by excluding two of the same transaction making a pair. However, to my understanding the matching scope is too high to be effective. We only care if CONSECUTIVE transactions have a within-7-day gap, but the AND a1.id <> a2.id overdoes the job. For instance, we want to check the gap between transaction1 and transaction2,transaction2 and transaction3, NOT transaction1 and transaction3
Note: by using the user variable row_id trick, we can produce the row id which is used to match consecutive transactions for each user, thus eliminating the wasteful job of random transaction check.
select distinct t1.user_id
from
(select user_id,created_at,#row_id:=#row_id+1 as row_id
from amazon_transactions ,(select #row_id:=0) t
order by user_id,created_at)t1
join
(select user_id,created_at,#row_num:=#row_num+1 as row_num
from amazon_transactions ,(select #row_num:=0) t
order by user_id,created_at)t2
on t1.user_id=t2.user_id and t2.row_num-t1.row_id=1 and datediff(t2.created_at,t1.created_at)<=7
;
-- result
| user_id |
+---------+
| 1 |

What is SQL to select a property and the max number of occurrences of a related property?

I have a table like this:
Table: p
+----------------+
| id | w_id |
+---------+------+
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 6 | 5 |
| 6 | 8 |
| 6 | 10 |
| 6 | 10 |
| 7 | 8 |
| 7 | 10 |
+----------------+
What is the best SQL to get the following result? :
+-----------------------------+
| id | most_used_w_id |
+---------+-------------------+
| 5 | 8 |
| 6 | 10 |
| 7 | 8 |
+-----------------------------+
In other words, to get, per id, the most frequent related w_id.
Note that on the example above, id 7 is related to 8 once and to 10 once.
So, either (7, 8) or (7, 10) will do as result. If it is not possible to
pick up one, then both (7, 8) and (7, 10) on result set will be ok.
I have come up with something like:
select counters2.p_id as id, counters2.w_id as most_used_w_id
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters2
join (
select p_id, max(count_of_w_ids) as max_counter_for_w_ids
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters
group by p_id
) as p_max
on p_max.p_id = counters2.p_id
and p_max.max_counter_for_w_ids = counters2.count_of_w_ids
;
but I am not sure at all whether this is the best way to do it. And I had to repeat the same sub-query two times.
Any better solution?
Try to use User defined variables
select id,w_id
FROM
( select T.*,
if(#id<>id,1,0) as row,
#id:=id FROM
(
select id,W_id, Count(*) as cnt FROM p Group by ID,W_id
) as T,(SELECT #id:=0) as T1
ORDER BY id,cnt DESC
) as T2
WHERE Row=1
SQLFiddle demo
Formal SQL
In fact - your solution is correct in terms of normal SQL. Why? Because you have to stick with joining values from original data to grouped data. Thus, your query can not be simplified. MySQL allows to mix non-group columns and group function, but that's totally unreliable, so I will not recommend you to rely on that effect.
MySQL
Since you're using MySQL, you can use variables. I'm not a big fan of them, but for your case they may be used to simplify things:
SELECT
c.*,
IF(#id!=id, #i:=1, #i:=#i+1) AS num,
#id:=id AS gid
FROM
(SELECT id, w_id, COUNT(w_id) AS w_count
FROM t
GROUP BY id, w_id
ORDER BY id DESC, w_count DESC) AS c
CROSS JOIN (SELECT #i:=-1, #id:=-1) AS init
HAVING
num=1;
So for your data result will look like:
+------+------+---------+------+------+
| id | w_id | w_count | num | gid |
+------+------+---------+------+------+
| 7 | 8 | 1 | 1 | 7 |
| 6 | 10 | 2 | 1 | 6 |
| 5 | 8 | 3 | 1 | 5 |
+------+------+---------+------+------+
Thus, you've found your id and corresponding w_id. The idea is - to count rows and enumerate them, paying attention to the fact, that we're ordering them in subquery. So we need only first row (because it will represent data with highest count).
This may be replaced with single GROUP BY id - but, again, server is free to choose any row in that case (it will work because it will take first row, but documentation says nothing about that for common case).
One little nice thing about this is - you can select, for example, 2-nd by frequency or 3-rd, it's very flexible.
Performance
To increase performance, you can create index on (id, w_id) - obviously, it will be used for ordering and grouping records. But variables and HAVING, however, will produce line-by-line scan for set, derived by internal GROUP BY. It isn't such bad as it was with full scan of original data, but still it isn't good thing about doing this with variables. On the other hand, doing that with JOIN & subquery like in your query won't be much different, because of creating temporery table for subquery result set too.
But to be certain, you'll have to test. And keep in mind - you already have valid solution, which, by the way, isn't bound to DBMS-specific stuff and is good in terms of common SQL.
Try this query
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having max(ccc)
here is the sqlfidddle link
You can also use this code if you do not want to rely on the first record of non-grouping columns
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having ccc=max(ccc);

MySQL conditionally populate column 3 based on DISTINCT involving 2 other columns in one table

Had a good read through similar topics but I can't quite a) find one to match my scenario, or b) understand others enough to fit / tailor / tweek to my situation.
I have a table, the important fields being;
+------+------+--------+--------+
| ID | Name | Price |Status |
+------+------+--------+--------+
| 1 | Fred | 4.50 | |
| 2 | Fred | 4.50 | |
| 3 | Fred | 5.00 | |
| 4 | John | 7.20 | |
| 5 | John | 7.20 | |
| 6 | John | 7.20 | |
| 7 | Max | 2.38 | |
| 8 | Max | 2.38 | |
| 9 | Sam | 21.00 | |
+------+------+--------+--------+
ID is an auto-incrementing value as records get added throughout the day.
NAME is a Primary Key field, which can repeat 1 to 3 times in the whole table.
Each NAME will have a PRICE value, which may or may not be the same per NAME.
There is also a STATUS field that need to be populated based on the following, which is actually the part I am stuck on.
Status = 'Y' if each DISTINCT name has only one price attached to it.
Status = 'N' if each DISTINCT name has multiple prices attached to it.
Using the table above, ID's 1, 2 and 3 should be 'N', whilst 4, 5, 6, 7, 8 and 9 should be 'Y'.
I think this may well involve some form of combination of JOINs, GROUPs, and DISTINCTs but I am at a loss on how to put that into the right order for SQL.
In order to get the count of distinct Price values per name, we must use a GROUP BY on the Name field, but since you also want to display all names ungrouped but with an additional Status field, we must first create a subselect in the FROM clause which groups by the name and determines whether the name has multiple price values or not.
When we GROUP BY Name in the subselect, COUNT(DISTINCT price) will count the number of distinct price values for each particular name. Without the DISTINCT keyword, it would simply count the number of rows where price is not null.
In conjunction with that, we use a CASE expression to insert N into the Status column if there is more than one distinct Price value for the particular name, otherwise, it will insert Y.
The subselect only returns one row per Name, so to get all names ungrouped, we join that subselect to the main table on the condition that the subselect's Name = the main table's Name:
SELECT
b.ID,
b.Name,
b.Price,
a.Status
FROM
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) a
INNER JOIN
tbl b ON a.Name = b.Name
Edit: In order to facilitate an update, you can incorporate this query using JOINs in the UPDATE like so:
UPDATE
tbl a
INNER JOIN
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) b ON a.Name = b.Name
SET
a.Status = b.Status
Assuming you have an unfilled Status column in your table.
If you want to update the status column, you could do:
UPDATE mytable s
SET status = (
SELECT IF(COUNT(DISTINCT price)=1, 'Y', 'N') c
FROM (
SELECT *
FROM mytable
) s1
WHERE s1.name = s.name
GROUP BY name
);
Technically, it should not be necessary to have this:
FROM (
SELECT *
FROM mytable
) s1
but there is a mysql limitation that prevents you to select from the table you're updating. By wrapping it in parenthesis, we force mysql to create a temporary table and then it suddenly is possible.

Using SQL to get distinct rows, but also the whole row for those

Ok so its easier to give an example and hopefully some has a solution:
I have table that holds bids:
ID | companyID | userID | contractID | bidAmount | dateAdded
Below is an example set of rows that could be in the table:
ID | companyID | userID | contractID | bidAmount | dateAdded
--------------------------------------------------------------
10 | 2 | 1 | 94 | 1.50 | 1309933407
9 | 2 | 1 | 95 | 1.99 | 1309933397
8 | 2 | 1 | 96 | 1.99 | 1309933394
11 | 103 | 1210 | 96 | 1.98 | 1309947237
12 | 2 | 1 | 96 | 1.97 | 1309947252
Ok so what I would like to do is to be able to get all the info (like by using * in a normal select statement) the lowest bid for each unique contractID.
So I would need the following rows:
ID = 10 (for contractID = 94)
ID = 9 (for contractID - 95)
ID = 12 (for contractID = 96)
I want to ignore all the others. I thought about using DISTINCT, but i haven't been able to get it to return all the columns, only the column I'm using for distinct.
Does anyone have any suggestions?
Thanks,
Jeff
select *
from mytable main
where bidAmount = (
select min(bidAmount)
from mytable
where contractID = main.contractID)
Note that this will return multiple rows if there is more than one record sharing the same minimum bid.
Didn't test it but it should be possible with this query although it might not be really fast:
SELECT * FROM bids WHERE ID IN (
SELECT ID FROM bids GROUP BY contractID ORDER BY MIN(bidAmount) ASC
)
This would be the query for MySQL, maybe you need to adjust it for another db.
You could use a subquery to find the lowest rowid per contractid:
select *
from YourTable
where id in
(
select min(id)
from YourTable
group by
ContractID
)
The problem is that distinct does not return a specific row - it return distinct values, which ( by definition ) could occur on multiple rows.
Subqueries are your answer, and somewhere in the suggestions above is probably the answer. Your subquery need to return the ids or the rows with the minimum bidvalue. Then you can select * from the rows with those ids.

MySQL: How to GROUP BY a field to retrieve the rows with ORDER BY another field?

assume following data:
Data:
id | date | name | grade
--------+---------------+-----------+---------------
1 | 2010/12/03 | Mike | 12
2 | 2010/12/04 | Jenny | 12
3 | 2010/12/04 | Ronald | 15
4 | 2010/12/03 | Yeni | 11
i want to know who has the best grade in each day, something like this:
Desired Result:
id | date | name | grade
--------+---------------+-----------+---------------
1 | 2010/12/03 | Mike | 12
3 | 2010/12/04 | Ronald | 15
i thought query should look like this:
SELECT name FROM mytable
GROUP BY date
ORDER BY grade DESC
but it returns something like this:
Current Unwanted Result:
id | date | name | grade
--------+---------------+-----------+---------------
1 | 2010/12/03 | Mike | 12
2 | 2010/12/04 | Jenny | 12
i searched and i found the reason:
GROUP BY happens before ORDER BY so it does not see and can't apply ORDER.
so how can i apply ORDER on GROUP BY?
Note: please keep in mind that i need the most simple query, because my query is actually very complex, i know i can achieve this result by some subquery or JOINing, but i want to know how to apply ORDER to GROUP BY. thanks
I used Oracle for this example, but the SQL should work in mysql (you may need to tweak the to_date stuff to work with mysql). You really need a subquery here to do what you are asking.
CREATE TABLE mytable (ID NUMBER, dt DATE, NAME VARCHAR2(25), grade NUMBER);
INSERT INTO mytable VALUES(1,to_date('2010-12-03','YYYY-MM-DD'),'Mike',12);
INSERT INTO mytable VALUES(1,to_date('2010-12-04','YYYY-MM-DD'),'Jenny',12);
INSERT INTO mytable VALUES(1,to_date('2010-12-04','YYYY-MM-DD'),'Ronald',15);
INSERT INTO mytable VALUES(1,to_date('2010-12-03','YYYY-MM-DD'),'Yeni',11);
SELECT id
, dt
, name
, grade
FROM mytable t1
WHERE grade = (SELECT max(grade)
FROM mytable t2
WHERE t1.dt = t2.dt)
ORDER BY dt
Results:
ID DT NAME GRADE
1 12/3/2010 Mike 12
2 12/4/2010 Ronald 15
I know you said you wanted a GROUP / ORDER only solution but you will need to use a subquery in this instance. The simplest way would be something like this:
SELECT id, date, name, grade
FROM mytable t1
WHERE grade =
(SELECT MAX(t2.grade) FROM mytable t2 WHERE t1.id = t2.id)
This would show multiple students if they shared the highest grade for the day.