DB2 - LISTAGG() with DISTINCT clause - doesn't work? - duplicates

My query has a column with a small number of values in it, and I need to display them in a single field for each grouped result set - e.g. if I had an employee in 3 different departments, I'd want to see something like
EMPID DEPTS SOMENUMBER SOMEOTHERNUMBER
------ ------ ----------- ---------------
0001 ACCOUNTING, CUST SERVICE, CALL CENTER 100 200
The problem is when there are multiple duplicate departments for the employee. I see numerous questions on how to figure this out for Oracle and other DMBSs, but nothing specific to DB2. IBM's own documentation at https://www.ibm.com/support/knowledgecenter/SSFMBX/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0058709.html says:
- "If DISTINCT is specified, duplicate string-expression values are eliminated.", and
- "If DISTINCT is specified for LISTAGG, the sort-key of the ORDER BY specification must match string-expression (SQLSTATE 42822). If string-expression is implicitly cast, the sort-key must explicitly include a corresponding matching cast specification.".
As a simplistic example:
with mylist (field1) as
( values 'A','A','B','C','D','A','C'
)
select listagg (distinct field1, ', ')
within group (order by field1)
from mylist;
The info on the IBM page suggests that this should return 'A, B, C, D' but instead I get 'A. A. A. B, C. C, D'.
I should be able to get around this by having a subquery do some initial rolling up - e.g.
SELECT EMPLOYEE, LISTAGG(DEPARTMENT,', '),
SUM(SOMENUMBER) SOMENUMBER, SUM(SOMEOTHERNUMBER) SOMEOTHERNUMBER
from (
SELECT EMPLOYEE, DEPARTMENT, SUM(SOMENUMBER) SOMENUMBER, SUM(SOMEOTHERNUMBER) SOMEOTHERNUMBER
FROM EMPLOYEES GROUP BY EMPLOYEE, DEPARTMENT)
) GROUP BY EMPLOYEE
and in fact I guess that's what I'll do, but the IBM documentation sure suggests the DISTINCT ought to do the trick. What am I missing?

You haven't specified, but if you use DB2 9.7 LUW, like me, LISTAGG(DISTINCT .. doesn't (yet) work. You have to workaround it by XML functions, e. g.:
with mylist (field1) as
( values 'A','A','B','C','D','A','C'
)
select XMLCAST(
XMLQUERY('string-join(distinct-values($x//row), ", ")'
PASSING XMLGROUP(field1 ORDER BY field1) AS "x"
) AS VARCHAR(200))
from mylist

Related

what does it mean when we use " group by 1" in an SQL Query

I have come across a query where it is specified by
select concat(21*floor(diff/21), '-', 21*floor(diff/21) + 20) as `range`, count(*) as
`number of users` from new_table group by 1 order by diff;
here what exactly does group by 1 mean?
Assuming you have a Select:
SELECT name FROM employee GROUP BY 1;
No matter what, it will always group by the first column given in the select.
In this case, the column 'name' is grouped.
So if we alternate the above statement to:
SELECT department FROM employee GROUP BY 1;
We now group the department, without having to change the '1' in the group by.
EDIT: (as requested by Stewart)
If we have the following Data in table 'employe':
-- name --
Walt
Walt
Brian
Barney
A simple select would deliver all rows above, whereas the 'group by 1' would result in one Walt-row:
output with group by:
-- name --
Walt
Brian
Barney
+1 to #FabianBigler for answering first, but I'll add this:
http://dev.mysql.com/doc/refman/5.6/en/select.html says:
Columns selected for output can be referred to in ORDER BY and GROUP BY clauses using column names, column aliases, or column positions. Column positions are integers and begin with 1.
For what it's worth, this is non-standard SQL, so don't expect it to work on other brands of SQL database.

Creating Temp Variables within Queries

I would like to be able to create a temp variable within a query--not a stored proc nor function-- which will not need to be declared and set so that I don't need to pass the query parameters when I call it.
Trying to work toward this:
Select field1,
tempvariable=2+2,
newlycreatedfield=tempvariable*existingfield
From
table
Away from this:
DECLARE #tempvariable
SET #tempvariable = 2+2
Select field1,
newlycreatedfield=#tempvariable*existingfield
From
table
Thank you for your time
I may have overcomplicated the example; more simply, the following gives the Invalid Column Name QID
Select
QID = 1+1
THN = QID + 1
If this is housed in a query, is there a workaround?
You can avoid derived tables and subqueries if you do a "hidden" assignment as a part of a complex concat_ws expression
Since the assignment is part of the expression of the ultimate desired value for the column, as opposed to sitting in its own column, you don't have to worry about whether MySQL will evaluate it in the correct order. Needless to say, if you want to use the temp var in multiple columns, then all bets are off :-/
caveat: I did this in MySQL 5.1.73; things might have changed in later versions
I wrap everything in concat_ws because it coalesces null args to empty strings, whereas concat does not.
I wrap the assignment to the var #stamp in an if so that it is "consumed" instead of becoming an arg to be concatenated. As a side note, I have guaranteed elsewhere that u.status_timestamp is populated when the user record is first created. Then #stamp is used in two places in date_format, both as the date to be formatted and in the nested if to select which format to use. The final concat is an hour range "h-h" which I have guaranteed elsewhere to exist if the c record exists, otherwise its null return is coalesced by the outer concat_ws as mentioned above.
SELECT
concat_ws( '', if( #stamp := ifnull( cs.checkin_stamp, u.status_timestamp ), '', '' ),
date_format( #stamp, if( timestampdiff( day, #stamp, now() )<120, '%a %b %e', "%b %e %Y" )),
concat( ' ', time_format( cs.start, '%l' ), '-', time_format( cs.end, '%l' ))
) AS as_of
FROM dbi_user AS u LEFT JOIN
(SELECT c.u_id, c.checkin_stamp, s.start, s.end FROM dbi_claim AS c LEFT JOIN
dbi_shift AS s ON(c.shift_id=s.id) ORDER BY c.u_id, c.checkin_stamp DESC) AS cs
ON (cs.u_id=u.id) WHERE u.status='active' GROUP BY u.id ;
A final note: while I happen to be using a derived table in this example, it is only because of the requirement to get the latest claim record and its associated shift record for each user. You probably won't need a derived table if a complex join is not involved in the computation of your temp var. This can be demonstrated by going to the first fiddle in #Fabien TheSolution's answer and changing the right hand query to
Select field1, concat_ws( '', if(#tempvariable := 2+2,'','') ,
#tempvariable*existingfield ) as newlycreatedfield
from table1
Likewise the second fiddle (which appears to be broken) would have a right hand side of
SELECT concat_ws( '', if(#QID := 2+2,'',''), #QID + 1) AS THN
You can do this with subqueries:
Select field1, tempvariable,
(tempvariable*existingfield) as newlycreatedfield
from (select t.*, (2+2) as tempvariable
from table t
) t;
Unfortunately, MySQL has a tendency to actually instantiate (i.e. create) a derived table for the subquery. Most other databases are smart enough to avoid this.
You can gamble that the following will work:
Select field1, (#tempvariable := 2+2) as tempvariable,
(#tempvariable*existingfield) as newlycreatedfield
From table t;
This is a gamble, because MySQL does not guarantee that the second argument is evaluated before the third. It seems to work in practice, but it is not guaranteed.
Why not just:
SET #sum = 4 + 7;
SELECT #sum;
Output:
+------+
| #sum |
+------+
| 11 |
+------+
source
You can do something like this :
SELECT field1, tv.tempvariable,
(tv.tempvariable*existingfield) AS newlycreatedfield
FROM table1
INNER JOIN (SELECT 2+2 AS tempvariable) AS tv
See SQLFIDDLE : http://www.sqlfiddle.com/#!2/8b0724/8/0
And to refer at your simplified example :
SELECT var.QID,
(var.QID + 1) AS THN
FROM (SELECT 1+1 as QID) AS var
See SQLFIDDLE : http://www.sqlfiddle.com/#!2/d41d8/19140/0

Page-navigation using stored procedure (can not bind a composite identifier)

I am using a stored procedure to make page-navigation when I view a list of stores. I have only one table Store with columns Name & S_Id.
And here is my query :
SELECT Stores.Name
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY Stores.S_Id) AS rownum ,
Stores.Name
FROM Stores
)AS ordered
WHERE ordered.rownum BETWEEN [someValue] AND [someValue]
But when I try to save my procedure I get an error:
can not bind a composite identifier Stores.Name
I have seen a lot of topics but cant find whats wrong. If I doing it with LINQ I will try something like this :
(select name from Stores order by S_Id).Skip(n).Take(m) .
Your subquery defines a new name - ordered - for your data - so you need to use that new name instead of stores:
SELECT
ordered.Name <=== you're selecting from the subquery which is called "ordered" - use that name!!
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY Stores.S_Id) AS rownum,
Stores.Name
FROM Stores
) AS ordered
WHERE
ordered.rownum BETWEEN [someValue] AND [someValue]

PostgreSQL equivalent for MySQL GROUP BY

I need to find duplicates in a table. In MySQL I simply write:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
Finds duplicates based on SOME_COLUMN, giving its repetition count.
Sorts in desc order of repetition, which is useful to quickly scan major dups.
Chooses a random value for all remaining columns, giving me an idea of values in those columns.
Similar query in Postgres greets me with an error:
column "MY_TABLE.SOME_COLUMN" must appear in the GROUP BY clause or be
used in an aggregate function
What is the Postgres equivalent of this query?
PS: I know that MySQL behaviour deviates from SQL standards.
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named "MY_TABLE" (all upper case). If you (more wisely) named it my_table (all lower case), then you can remove the double quotes or use lower case.
Also, I use ct instead of count as alias, because it is bad practice to use function names as identifiers.
Simple case
This would work with PostgreSQL 9.1:
SELECT *, count(id) ct
FROM my_table
GROUP BY primary_key_column(s)
ORDER BY ct DESC;
It requires primary key column(s) in the GROUP BY clause. The results are identical to a MySQL query, but ct would always be 1 (or 0 if id IS NULL) - useless to find duplicates.
Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use *.
SELECT DISTINCT ON (1, some_column)
count(*) OVER (PARTITION BY some_column) AS ct
,*
FROM my_table
ORDER BY 1 DESC, some_column, id, col1;
This works because DISTINCT ON (PostgreSQL specific), like DISTINCT (SQL-Standard), are applied after the window function count(*) OVER (...). Window functions (with the OVER clause) require PostgreSQL 8.4 or later and are not available in MySQL.
Works with any table, regardless of primary or unique constraints.
The 1 in DISTINCT ON and ORDER BY is just shorthand to refer to the ordinal number of the item in the SELECT list.
SQL Fiddle to demonstrate both side by side.
More details in this closely related answer:
Select first row in each GROUP BY group?
count(*) vs. count(id)
If you are looking for duplicates, you are better off with count(*) than with count(id). There is a subtle difference if id can be NULL, because NULL values are not counted - while count(*) counts all rows. If id is defined NOT NULL, results are the same, but count(*) is generally more appropriate (and slightly faster, too).
Here's another approach, uses DISTINCT ON:
select
distinct on(ct, some_column)
*,
count(id) over(PARTITION BY some_column) as ct
from my_table x
order by ct desc, some_column, id
Data source:
CREATE TABLE my_table (some_column int, id int, col1 int);
INSERT INTO my_table VALUES
(1, 3, 4)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 4)
,(3, 7, 3)
,(4, 8, 3)
,(4, 9, 4)
,(5, 10, 1)
,(5, 11, 2)
,(5, 11, 3);
Output:
SOME_COLUMN ID COL1 CT
5 10 1 3
2 4 1 2
3 6 4 2
4 8 3 2
1 3 4 1
Live test: http://www.sqlfiddle.com/#!1/e2509/1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
mysql allows group by to omit non-aggregated selected columns from the group by list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.
postgres on the other hand is SQL standard compliant.
There is no equivalent query in postgres.
Here is a self-joined CTE, which allows you to use select *. key0 is the intended unique key, {key1,key2} are the additional key elements needed to address the currently non-unique rows. Use at your own risk, YMMV.
WITH zcte AS (
SELECT DISTINCT tt.key0
, MIN(tt.key1) AS key1
, MIN(tt.key2) AS key2
, COUNT(*) AS cnt
FROM ztable tt
GROUP BY tt.key0
HAVING COUNT(*) > 1
)
SELECT zt.*
, zc.cnt AS cnt
FROM ztable zt
JOIN zcte zc ON zc.key0 = zt.key0 AND zc.key1 = zt.key1 AND zc.key2 = zt.key2
ORDER BY zt.key0, zt.key1,zt.key2
;
BTW: to get the intended behaviour for the OP, the HAVING COUNT(*) > 1 clause should be omitted.

Count on DISTINCT of several fields work only on MySQL?

I need a Query that without any changes work on these three different database server : MySQL, MSSQL, PostgreSQL .
In this query i have to to calculate a column with the following expression that work correctly on MySQL :
COUNT(DISTINCT field_char,field_int,field_date) AS costumernum
The fields in the distinct are of different type :
field_char = character
field_int = integer
field_date = datetime
The expression is inside a parent query select, so if i try to achieve the result with a sub query approach, i stumble in this situation :
SELECT t0.description,t0.depnum
(select count(*) from (
select distinct f1, f2, f3 from salestable t1
where t1.depnum = t0.depnum
) a) AS numitems
FROM salestable t0
I get an error with this query, how can i get the value of the parent query ?
The expression work correctly on MySQL but i get an error when i try to execute it on Sql Server or PostgreSQL (the problem is that the count function doesn't accept 3 arguments of different type on MSSQL/PostgreSQL), is there a way to achieve the same result with an expression that work in each of these database server (SQL Server, MySQL, PostgreSQL ) ?
A general way to do this on any platform is as follows:
select count(*) from (
select distinct f1, f2, f3 from table
) a
Edit for new info:
What if you try joining to the distinct list (including the dept) and then doing the count? I created some test data and this seems to work. Make sure the COUNT is on one of the t1 columns - otherwise it will mistakenly return 1 instead of 0 when there are no corresponding entries in t1.
SELECT t0.description, t0.depnum, count(t1.depnum) as 'numitems'
FROM salestable t0
LEFT JOIN (select distinct f1,f2,f3,depnum from salestable) t1
ON t0.depnum = t1.depnum
GROUP BY
t0.description, t0.depnum
How about concatenating?
COUNT(DISTINCT field_char || '.' ||
cast(field_int as varchar) || '.' ||
cast(field_date as varchar)) AS costumernum
Warning: your concatenation operator may vary with RDBMS flavor.
Update
Apparently, the concatenation operator portability is question by itself:
String concatenation operator in Oracle, Postgres and SQL Server
I tried to help you with the distinct issue.