Query to Segment Results Based on Equal Sets of Column Value - mysql

I'd like to construct a single query (or as few as possible) to group a data set. So given a number of buckets, I'd like to return results based on a specific column.
So given a column called score which is a double which contains:
90.00
91.00
94.00
96.00
98.00
99.00
I'd like to be able to use a GROUP BY clause with a function like:
SELECT MIN(score), MAX(score), SUM(score) FROM table GROUP BY BUCKETS(score, 3)
Ideally this would return 3 rows (grouping the results into 3 buckets with as close to equal count in each group as is possible):
90.00, 91.00, 181.00
94.00, 96.00, 190.00
98.00, 99.00, 197.00
Is there some function that would do this? I'd like to avoid returning all the rows and figuring out the bucket segments myself.
Dave

create table test (
id int not null auto_increment primary key,
val decimal(4,2)
) engine = myisam;
insert into test (val) values
(90.00),
(91.00),
(94.00),
(96.00),
(98.00),
(99.00);
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row
from test,(select #row:=0) as r order by id
) as t
group by ceil(row/2)
+-------+--------+--------+
| lower | higher | total |
+-------+--------+--------+
| 90.00 | 91.00 | 181.00 |
| 94.00 | 96.00 | 190.00 |
| 98.00 | 99.00 | 197.00 |
+-------+--------+--------+
3 rows in set (0.00 sec)
Unluckily mysql doesn't have analytical function like rownum(), so you have to use some variable to emulate it. Once you do it, you can simply use ceil() function in order to group every tot rows as you like. Hope that it helps despite my english.
set #r = (select count(*) from test);
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row
from test,(select #row:=0) as r order by id
) as t
group by ceil(row/ceil(#r/3))
or, with a single query
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row,tot
from test,(select count(*) as tot from test) as t2,(select #row:=0) as r order by id
) as t
group by ceil(row/ceil(tot/3))

Related

How to fetch rows from which sum of a single integer/float column sums upto a certain value

I have a table. It has the following structure
goods_receiving_items
id
item_id
quantity
created_at
I am trying to fetch rows against which have the following conditions
Has one item_id
When the sum of the quantity column equals a certain value
So for example I have the following data
+----+---------+----------+------------+
| id | item_id | quantity | created_at |
+----+---------+----------+------------+
| 1 | 2 | 11 | 2019-10-10 |
| 2 | 3 | 110 | 2019-10-11 |
| 3 | 2 | 20 | 2019-11-09 |
| 4 | 2 | 5 | 2019-11-10 |
| 5 | 2 | 1 | 2019-11-11 |
+----+---------+----------+------------+
I have tried the following query:
SET #sum:= 0;
SELECT item_id, created_at, (#sum:= #sum + quantity) AS SUM, quantity
FROM goods_receiving_items
WHERE item_id = 2 AND #sum<= 6
ORDER BY created_at DESC
If I don't use ORDER BY, then the query will give me ID '1'. But if I use ORDER BY it will return all the rows with item_id = 2.
What should be returned are IDs '5' and '4' exclusively in this order
I can't seem to resolve this and ORDER BY is essential to my task.
Any help would be appreciated
You should use the order by on the resulting set
you could do this using a subquery
SET #sum:= 0;
select t.*
from t (
SELECT item_id
, created_at
, (#sum:= #sum + quantity) as sum
, quantity
FROM goods_receiving_items
WHERE item_id = 2 AND #sum<= 6
) t
ORDER BY created_at DESC
You should try an INNER JOIN with SELECT min(created_at) or SELECT max(created_at)
From MYSQL docs:
...the selection of values from each group cannot be influenced by
adding an ORDER BY clause. Sorting of the result set occurs after
values have been chosen, and ORDER BY does not affect which values the
server chooses.
The answers on the following might help in more detail: MYSQL GROUP BY and ORDER BY not working together as expected
After searching around, I have made up the following query
SELECT
t.id, t.quantity, t.created_at, t.sum
FROM
( SELECT
*,
#bal := #bal + quantity AS sum,
IF(#bal >= $search_number, #doneHere := #doneHere + 1 , #doneHere) AS whereToStop
FROM goods_receiving_items
CROSS JOIN (SELECT #bal := 0.0 , #doneHere := 0) var
WHERE item_id = $item_id
ORDER BY created_at DESC) AS t
WHERE t.whereToStop <= 1
ORDER BY t.created_at ASC
In the above query, $search_number is a variable that holds the value that has to be reached. $item_id is the item we are searching against.
This will return all rows for which the sum of the column quantity makes up the required sum. The sum will be made with rows in descending order by created_at and then will be rearranged in ascending order.
I was using this query to calculate the cost when a certain amount of items are being used in an inventory management system; so this might help someone else do the same. I took most of the query from another question here on StackOverflow

Select all records where last n characters in column are not unique

I have bit strange requirement in mysql.
I should select all records from table where last 6 characters are not unique.
for example if I have table:
I should select row 1 and 3 since last 6 letters of this values are not unique.
Do you have any idea how to implement this?
Thank you for help.
I uses a JOIN against a subquery where I count the occurences of each unique combo of n (2 in my example) last chars
SELECT t.*
FROM t
JOIN (SELECT RIGHT(value, 2) r, COUNT(RIGHT(value, 2)) rc
FROM t
GROUP BY r) c ON c.r = RIGHT(value, 2) AND c.rc > 1
Something like that should work:
SELECT `mytable`.*
FROM (SELECT RIGHT(`value`, 6) AS `ending` FROM `mytable` GROUP BY `ending` HAVING COUNT(*) > 1) `grouped`
INNER JOIN `mytable` ON `grouped`.`ending` = RIGHT(`value`, 6)
but it is not fast. This requires a full table scan. Maybe you should rethink your problem.
EDITED: I had a wrong understanding of the question previously and I don't really want to change anything from my initial answer. But if my previous answer is not acceptable in some environment and it might mislead people, I have to correct it anyhow.
SELECT GROUP_CONCAT(id),RIGHT(VALUE,6)
FROM table1
GROUP BY RIGHT(VALUE,6) HAVING COUNT(RIGHT(VALUE,6)) > 1;
Since this question already have good answers, I made my query in a slightly different way. And I've tested with sql_mode=ONLY_FULL_GROUP_BY. ;)
This is what you need: a subquery to get the duplicated right(value,6) and the main query yo get the rows according that condition.
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT RIGHT(`value`,6)
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1);
UPDATE
This is the solution to avoid the mysql error in the case you have sql_mode=only_full_group_by
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT DISTINCT right_value FROM (
SELECT RIGHT(`value`,6) AS right_value,
COUNT(*) AS TOT
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1) t2
)
Fiddle here
Might be a fast code, as there is no counting involved.
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/0
select *
from tbl outr
where not exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | --------------- |
| 2 | aaaaaaaaaaaaaa |
| 4 | aaaaaaaaaaaaaaB |
| 5 | Hello |
The logic is to test other rows that is not equal to the same id of the outer row. If those other rows has same right 6 characters as the outer row, then don't show that outer row.
UPDATE
I misunderstood the OP's intent. It's the reversed. Anyway, just reverse the logic. Use EXISTS instead of NOT EXISTS
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/3
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | ----------- |
| 1 | abcdePuzzle |
| 3 | abcPuzzle |
UPDATE
Tested the query. The performance of my answer (correlated EXISTS approach) is not optimal. Just keeping my answer, so others will know what approach to avoid :)
GhostGambler's answer is faster than correlated EXISTS approach. For 5 million rows, his answer takes 2.762 seconds only:
explain analyze
SELECT
tbl.*
FROM
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
) grouped
JOIN tbl ON grouped.ending = RIGHT(value, 6)
My answer (correlated EXISTS) takes 4.08 seconds:
explain analyze
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Straightforward query is the fastest, no join, just plain IN query. 2.722 seconds. It has practically the same performance as JOIN approach since they have the same execution plan. This is kiks73's answer. I just don't know why he made his second answer unnecessarily complicated.
So it's just a matter of taste, or choosing which code is more readable select from in vs select from join
explain analyze
SELECT *
FROM tbl
where right(value, 6) in
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
)
Result:
Test data used:
CREATE TABLE tbl (
id INTEGER primary key,
value VARCHAR(20)
);
INSERT INTO tbl
(id, value)
VALUES
('1', 'abcdePuzzle'),
('2', 'aaaaaaaaaaaaaa'),
('3', 'abcPuzzle'),
('4', 'aaaaaaaaaaaaaaB'),
('5', 'Hello');
insert into tbl(id, value)
select x.y, 'Puzzle'
from generate_series(6, 5000000) as x(y);
create index ix_tbl__right on tbl(right(value, 6));
Performances without the index, and with index on tbl(right(value, 6)):
JOIN approach:
Without index: 3.805 seconds
With index: 2.762 seconds
IN approach:
Without index: 3.719 seconds
With index: 2.722 seconds
Just a bit neater code (if using MySQL 8.0). Can't guarantee the performance though
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/1
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count = 1
Output:
| id | value | unique_count |
| --- | --------------- | ------------ |
| 2 | aaaaaaaaaaaaaa | 1 |
| 4 | aaaaaaaaaaaaaaB | 1 |
| 5 | Hello | 1 |
UPDATE
I misunderstood OP's intent. It's the reversed. Just change the count:
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count > 1
Output:
| id | value | unique_count |
| --- | ----------- | ------------ |
| 1 | abcdePuzzle | 2 |
| 3 | abcPuzzle | 2 |

MySQL select - If a column value is redundant, only show the newest by timestamp

I have a table like this:
timesent |nr | value
2018-10-31 05:23:06 | 4 | Value 3
2018-10-31 05:20:19 | 4 | Value 2
2018-10-31 05:19:35 | 4 | Value 1
2018-10-31 04:55:56 | 3 | Value 2
2018-10-31 03:05:15 | 3 | Value 1
2018-10-31 01:31:49 | 2 | Value 1
2018-10-30 04:11:16 | 1 | Value 1
At the moment, my select looks like this:
SELECT * FROM values WHERE ORDER BY timesent DESC
I want to do an sql-select statement which gives me back only the most recent value of each "nr".
My skills are not good enough to translate that into a sql-statement. I donĀ“t even know what I should google for.
Values is a Reserved Keyword in MySQL. Consider changing your table name to something else; otherwise you will have to use backticks around it
There are various ways to achieve the result for your problem. One way is to do a "Self-Left-Join" on nr (field on which you want to get the maximum timesent value row only).
SELECT v1.*
FROM `values` AS v1
LEFT JOIN `values` AS v2
ON v1.nr = v2.nr AND
v1.timesent < v2.timesent
WHERE v2.nr IS NULL
For MySQL version >= 8.0.2, you can use Window Functions. We will determine Row_Number() for each row over a partition of nr, with timesent in Descending order (Highest timesent value will have row number = 1). Then, use this result-set in a Derived Table and consider only those rows, where row number is equal to 1.
SELECT dt.timesent,
dt.nr,
dt.value
FROM
(
SELECT v.timesent, v.nr, v.value,
ROW_NUMBER() OVER (PARTITION BY v.nr
ORDER BY v.timesent DESC) AS row_num
FROM `values` AS v
) AS dt
WHERE dt.row_num = 1
Yet, another approach is to get the maximum value of timesent for a nr group in a Derived Table. Now join this result-set to the main table, so that only the rows corresponding to max value appear:
SELECT v.timesent,
v.nr,
v.value
FROM
`values` AS v
JOIN
(
SELECT nr, MAX(timesent) AS max_timesent
FROM `values`
GROUP BY nr
) AS dt ON dt.nr = v.nr AND
dt.max_timesent = v.timesent

Find two closest elements from one table to other element from another table

I have two tables:
DROP TABLE IF EXISTS `left_table`;
CREATE TABLE `left_table` (
`l_id` INT(11) NOT NULL AUTO_INCREMENT,
`l_curr_time` INT(11) NOT NULL,
PRIMARY KEY(l_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DROP TABLE IF EXISTS `right_table`;
CREATE TABLE `right_table` (
`r_id` INT(11) NOT NULL AUTO_INCREMENT,
`r_curr_time` INT(11) NOT NULL,
PRIMARY KEY(r_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO left_table(l_curr_time) VALUES
(3),(4),(6),(10),(13);
INSERT INTO right_table(r_curr_time) VALUES
(1),(5),(7),(8),(11),(12);
I want to map (if exists) two closest r_curr_time from right_table to each l_curr_time from left_table such that r_curr_time must be greater or equal to l_curr_time.
The expected result for given values should be:
+------+-------------+-------------+
| l_id | l_curr_time | r_curr_time |
+------+-------------+-------------+
| 1 | 3 | 5 |
| 1 | 3 | 7 |
| 2 | 4 | 5 |
| 2 | 4 | 7 |
| 3 | 6 | 7 |
| 3 | 6 | 8 |
| 4 | 10 | 11 |
| 4 | 10 | 12 |
+------+-------------+-------------+
I have following solution which works for one closest value. But I do not like it very much because it silently rely on fact that GROUP BY will remain the first occurrence from group:
SELECT l_id, l_curr_time, r_curr_time, time_diff FROM
(
SELECT *, ABS(r_curr_time - l_curr_time) AS time_diff
FROM left_table
JOIN right_table ON 1=1
WHERE r_curr_time >= l_curr_time
ORDER BY l_id ASC, time_diff ASC
) t
GROUP BY l_id;
The output is following:
+------+-------------+-------------+-----------+
| l_id | l_curr_time | r_curr_time | time_diff |
+------+-------------+-------------+-----------+
| 1 | 3 | 5 | 2 |
| 2 | 4 | 5 | 1 |
| 3 | 6 | 7 | 1 |
| 4 | 10 | 11 | 1 |
+------+-------------+-------------+-----------+
4 rows in set (0.00 sec)
As you can see I am doing JOIN ON 1=1 is this OK also for large data (e.g. if both left_table and right_table has 10000 rows then Cartesian product will be 10^8 long)? Despite this lack I thing JOIN ON 1=1 is the only possible solution because first I need to create all possible combinations from existing tables and then pick up the ones which satisfies the condition, but if I'm wrong please correct me. Thanks.
This question is not trivial. In SQL Server or postgrsql it would be very easy because of the row_number() over x statement. This is not present in mysql. In mysql you have to deal with variables and chained select statements.
To solve this problem you have to combine multiple concepts. I will try to explain them one after the other to came to a solution that fits your question.
Lets start easy: How to build a table that contains the information of left_table and right_table?
Use a join. In this particular problem a left join and as the join condition we set that l_curr_time has to be smaller than r_curr_time. To make the rest easier we order this table by l_curr_time and r_curr_time. The statement is like the following:
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time;
Now we have a table that is ordered and contains the information we want... but too many of them ;) Because the table is ordered it would be amazing if mysql could select only the two first occurent rows for each value in l_curr_time. This is not possible. We have to do it by ourselfs
mid part: How to number rows?
Use a variable! If you want to number a table you can use a mysql variable. There are two things to do: First of all we have to declare and define the variable. Second we have to increment this variable. Let's say we have a table with names and we want to know the position of all names when we order them by name:
SELECT name, #num:=#num+1 /* increment */
FROM table t, (SELECT #num:=0) as c
ORDER BY name ASC;
Hard part: How to number subset of rows depending of the value of one field?
Use variables to count (take a look above) and a variable for state pattern. We use the same principe like above but now we take a variable and save the value of the field we want depend on. If the value changes we reset the counter variable to zero. Again: This second variable have to be declared and defined. New Part: resetting a different variable depending on the content of the state variable:
SELECT
l_id,
l_curr_time,
r_curr_time,
#num := IF( /* (re)set num (the counter)... */
#l_curr_time = l_curr_time,
#num:= #num + 1, /* increment if the variable equals the actual l_curr_time field value */
1 /* reset to 1 if the values are not equal */
) as row_num,
#l_curr_time:=l_curr_time as lct /* state variable that holds the l_curr_time value */
FROM ( /* table from Step 1 of the explanation */
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time
) as joinedTable
Now we have a table that holds all combinations we want (but too many) and all rows are numbered depending on the value of the l_curr_time field. In other words: Each subset is numbered from 1 to the amount of matching r_curr_time values that are greather or equal than l_curr_time.
Again the easy part: select all the values we want and depending on the row number
This part is easy. because the table we created in 3. is ordered and numbered we can filter by the number (it has to be smaller or equal to 2). Furthermore we select only the columns we're interessted in:
SELECT l_id, l_curr_time, r_curr_time, row_num
FROM ( /* table from step 3. */
SELECT
l_id,
l_curr_time,
r_curr_time,
#num := IF(
#l_curr_time = l_curr_time,
#num:= #num + 1,
1
) as row_num,
#l_curr_time:=l_curr_time as lct
FROM (
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time
) as joinedTable
) as numberedJoinedTable,(
SELECT #l_curr_time:='',#num:=0 /* define the state variable and the number variable */
) as counterTable
HAVING row_num<=2; /* the number has to be smaller or equal to 2 */
That's it. This statement returns exactly what you want. You can see this statement in action in this sqlfiddle.
JoshuaK has the right idea. I just think it could be expressed a little more succinctly...
How about:
SELECT n.l_id
, n.l_curr_time
, n.r_curr_time
FROM
( SELECT a.*
, CASE WHEN #prev = l_id THEN #i:=#i+1 ELSE #i:=1 END i
, #prev := l_id prev
FROM
( SELECT l.*
, r.r_curr_time
FROM left_table l
JOIN right_table r
ON r.r_curr_time >= l.l_curr_time
) a
JOIN
( SELECT #prev := null,#i:=0) vars
ORDER
BY l_id,r_curr_time
) n
WHERE i<=2;

Using ORDER BY and GROUP BY together

My table looks like this (and I'm using MySQL):
m_id | v_id | timestamp
------------------------
6 | 1 | 1333635317
34 | 1 | 1333635323
34 | 1 | 1333635336
6 | 1 | 1333635343
6 | 1 | 1333635349
My target is to take each m_id one time, and order by the highest timestamp.
The result should be:
m_id | v_id | timestamp
------------------------
6 | 1 | 1333635349
34 | 1 | 1333635336
And i wrote this query:
SELECT * FROM table GROUP BY m_id ORDER BY timestamp DESC
But, the results are:
m_id | v_id | timestamp
------------------------
34 | 1 | 1333635323
6 | 1 | 1333635317
I think it causes because it first does GROUP_BY and then ORDER the results.
Any ideas? Thank you.
One way to do this that correctly uses group by:
select l.*
from table l
inner join (
select
m_id, max(timestamp) as latest
from table
group by m_id
) r
on l.timestamp = r.latest and l.m_id = r.m_id
order by timestamp desc
How this works:
selects the latest timestamp for each distinct m_id in the subquery
only selects rows from table that match a row from the subquery (this operation -- where a join is performed, but no columns are selected from the second table, it's just used as a filter -- is known as a "semijoin" in case you were curious)
orders the rows
If you really don't care about which timestamp you'll get and your v_id is always the same for a given m_i you can do the following:
select m_id, v_id, max(timestamp) from table
group by m_id, v_id
order by max(timestamp) desc
Now, if the v_id changes for a given m_id then you should do the following
select t1.* from table t1
left join table t2 on t1.m_id = t2.m_id and t1.timestamp < t2.timestamp
where t2.timestamp is null
order by t1.timestamp desc
Here is the simplest solution
select m_id,v_id,max(timestamp) from table group by m_id;
Group by m_id but get max of timestamp for each m_id.
You can try this
SELECT tbl.* FROM (SELECT * FROM table ORDER BY timestamp DESC) as tbl
GROUP BY tbl.m_id
SQL>
SELECT interview.qtrcode QTR, interview.companyname "Company Name", interview.division Division
FROM interview
JOIN jobsdev.employer
ON (interview.companyname = employer.companyname AND employer.zipcode like '100%')
GROUP BY interview.qtrcode, interview.companyname, interview.division
ORDER BY interview.qtrcode;
I felt confused when I tried to understand the question and answers at first. I spent some time reading and I would like to make a summary.
The OP's example is a little bit misleading.
At first I didn't understand why the accepted answer is the accepted answer.. I thought that the OP's request could be simply fulfilled with
select m_id, v_id, max(timestamp) as max_time from table
group by m_id, v_id
order by max_time desc
Then I took a second look at the accepted answer. And I found that actually the OP wants to express that, for a sample table like:
m_id | v_id | timestamp
------------------------
6 | 1 | 11
34 | 2 | 12
34 | 3 | 13
6 | 4 | 14
6 | 5 | 15
he wants to select all columns based only on (group by)m_id and (order by)timestamp.
Then the above sql won't work. If you still don't get it, imagine you have more columns than m_id | v_id | timestamp, e.g m_id | v_id | timestamp| columnA | columnB |column C| .... With group by, you can only select those "group by" columns and aggreate functions in the result.
By far, you should have understood the accepted answer.
What's more, check row_number function introduced in MySQL 8.0:
https://www.mysqltutorial.org/mysql-window-functions/mysql-row_number-function/
Finding top N rows of every group
It does the simlar thing as the accepted answer.
Some answers are wrong. My MySQL gives me error.
select m_id,v_id,max(timestamp) from table group by m_id;
#abinash sahoo
SELECT m_id,v_id,MAX(TIMESTAMP) AS TIME
FROM table_name
GROUP BY m_id
#Vikas Garhwal
Error message:
[42000][1055] Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'testdb.test_table.v_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
Why make it so complicated? This worked.
SELECT m_id,v_id,MAX(TIMESTAMP) AS TIME
FROM table_name
GROUP BY m_id
Just you need to desc with asc. Write the query like below. It will return the values in ascending order.
SELECT * FROM table GROUP BY m_id ORDER BY m_id asc;