HIVE Sum Over query - partitioning

I'm trying to convert a query in Teradata to HIVE QL (HDF) and have struggled to find examples.
Teradata (my functional end goal) - want a count of records in the table, then for each growth_type_id value and ultimately a % each group is.
select trim(growth_type_id) AS VAL, COUNT(1) AS cnt, SUM(cnt) over () as GRP_CNT,CNT/(GRP_CNT* 1.0000) AS perc
from acdw_apex_account_strategy
qualify perc > .01 group by val
Note: running HDP-2.4.3.0-227

select val
,cnt
,grp_cnt
,cnt/(grp_cnt* 1.0000) as perc
from (select trim(growth_type_id) as val
,count(*) as cnt
,sum(count(*)) over () as grp_cnt
from acdw_apex_account_strategy
group by trim(growth_type_id)
) t
where cnt/grp_cnt > 0.01
;
QUALIFY is unique to Teradata.
Aliases used everywhere in the query is unique to Teradata.
Grouping by columns positions is parameter dependent - hive.groupby.orderby.position.alias
Grouping by aliases is not supported - https://issues.apache.org/jira/browse/HIVE-1683
Hive doesn't use integer arithmetic (e.g. 7/4 - 1.75 and not 1 as in teradata)
decimal notation without preceding digit(s) is not valid
P.s.
You are using QUALIFY before the GROUP BY and although Teradata syntax is agile and the only requirement is that the SELECT/WITH clause will be positioned first, I strongly recommend to keep the standard order of clauses:
WITH - SELECT - FROM - WHERE - GROUP BY - HAVING - ORDER BY

Related

Mysql: Sort an aggregate ascending with zeros last

I'm attempting to sort an aggregate column, which contains some zero values. I need the zero values to be last.
For non-aggregate columns I can do something like this (simplified example query):
SELECT age FROM books
ORDER BY
age = 0,
age ASC
However, for aggregate columns I'm getting an error as the column doesn't exist:
SELECT avg(age) as avg_age FROM books
GROUP BY book.type
ORDER BY
avg_age = 0,
avg_age ASC
The error is:
SQLSTATE[42S22]: Column not found: 1247 Reference 'avg_age' not supported (reference to group function)
I totally appreciate why this is happening, but I wasn't able to find a workaround, any tips?
There seams to be a (old) related bug report
[21 Mar 2016 9:22] Jiří Kavalík
Description: When using alias to aggregated column in ORDER BY only
plain alias is allowed, using it in any expression returns error.
http://sqlfiddle.com/#!9/e87bb/7
Workarounds:
- select the expression and use its alias
- use a derived table and order the outer one
How to repeat: create table t(a int);
-- these work select sum(a) x from t group by a order by x; select sum(a) x from t group by a order by sum(a); select sum(a) x from t
group by a order by -sum(a);
-- this one wrongly gives "Reference 'x' not supported (reference to group function)" select sum(a) x from t group by a order by -x;
source
You would have to write, this is better as the query is then also ANSI/ISO SQL standard valid meaning the query is most likely better portable between most databases vendor software.
SELECT
avg(books.age) as avg_age
FROM books
GROUP BY books.type
ORDER BY
avg(books.age) = 0
, avg(books.age) ASC
see demo this bug is fixed in MySQL 8.0 see demo
Try repeating the code
SELECT avg(age) as avg_age
FROM books
GROUP BY book.type
ORDER BY avg(age) = 0, avg(age) ASC

MySQL return summed values and a virtual column as (count - sum)

I have a table as follows:
log (log_id, log_success (bool), log_created)
I would like to SELECT and return 3 columns date success and no_success, where the former does not exist in table and finally aggregate them by day.
I have created this query:
SELECT
log_created as 'date'
COUNT(*) AS 'count',
SUM(log_success) AS 'success'
SUM('count' - 'success') AS 'no_success'
FROM send_log
GROUP BY DATE_FORMAT(log_created, '%Y-%m-%d');
Would I be able to achieve it with this query? Is my syntax correct?
Thanks.
You can't reuse an alias defined in the select within the same select clause. The reason for this is that it might not even have been defined when you go to access it. But, you easily enough can repeat the logic:
SELECT
log_created AS date,
SUM(log_success) AS success,
COUNT(*) - SUM(log_success) AS no_success,
FROM send_log
GROUP BY
log_created;
I don't know why you are calling DATE_FORMAT in the group by clause of your query. DATE_FORMAT is usually a presentation layer function, which you call because you want to view a date formatted a certain way. Since it appears that log_created is already a date, there is no need to call DATE_FORMAT on it when aggregating. You also should not even need in the select clause, because the default format for a MySQL date is already Y-m-d.
You must select DATE_FORMAT(log_created, '%Y-%m-%d') if you want to group by this.
Also you can get the no_success counter with SUM(abs(log_success - 1))
SELECT
DATE_FORMAT(log_created, '%Y-%m-%d') date,
SUM(log_success) log_success,
SUM(abs(log_success - 1)) no_success
FROM send_log
GROUP BY date;
See the demo

How to count when value crosses the average

I am trying to write a MySQL query that would count the number of times a value crosses a constant. The end result is we are tying to determine the relative 'noise' of the value via the amplitude and the frequency of the value. MIN() and MAX() provide the amplitude. Count() gives the number of samples that fit the criteria, but it doesn't provide how stable that value is. We are currently using MySQL 5.7 but we will be moving to MySQL 8.0 that provides the windowing features. Something like
Select Count(Value) over (order by logtime ROWS 1 Proeeding <123 AND 1 Following > 123) WHERE logtime BETWEEN...;
Thank your for any help you can provide.
SELECT Count(Value) WHERE Value > 123 AND logtime BETWEEN...;
SELECT Count(Value) WHERE Value < 123 AND logtime BETWEEN...;
Window functions are not available in MySQL versions before 8.0
With MySQL 5.7, we can emulate some window functions by using user-defined variables in a carefully crafted query. The MySQL Reference Manual gives explicit warning about using user-defined variables in a context like this. We are relying on behavior that is not guaranteed.
But as an example of the pattern I would use to achieve the specified result:
SELECT SUM(c.crossed_avg) AS count_crossed_avg
FROM (
SELECT IF( ( #prval > a.avg_ AND t.value < a.avg_ ) OR
( #prval < a.avg_ AND t.value > a.avg_ )
,1,0) AS crossed_avg
, #prval := t.value AS value_
FROM mytable t
CROSS
JOIN ( SELECT 123 AS avg_ ) a
CROSS
JOIN ( SELECT #prval := NULL ) i
WHERE ...
ORDER BY t.logtime
) c
To unpack this, focus first on the inline view query; that is, ignore the SELECT SUM() wrapper query, and run just the inline view query.
We order the rows by logtime so that we can process the rows in order.
We compare the value on the current row to the value from the previous row. If one is above average and the other is below average, then we return a 1, else we return 0.
Save the current value into the user-defined variable for comparing the next row. (Note: the order of operations is important; we are depending on MySQL to do that assignment after the evaluation of the IF() function.
The example query doesn't address the edge case when a row value is exactly equal to the average, e.g. a sequence of values 124.4 < 123.0 < 122.2. (We might want to consider changing the comparisons so that one includes the equality e.g. < and >=.

Using AS value in later on in query

Consider the following example query:
SELECT foo.bar,
DATEDIFF(
# Some more advanced logic, such as IF(,,), which shouldn't be copy pasted
) as bazValue
FROM foo
WHERE bazValue >= CURDATE() # <-- This doesn't work
How can I make the bazValue available later on in the query? I'd prefer this, since I believe that it's enough to maintain the code in one place if possible.
There are a couple of ways around this problem that you can use in MySQL:
By using an inline view (this should work in most other versions of SQL, too):
select * from
(SELECT foo.bar,
DATEDIFF(
# Some more advanced logic, such as IF(,,), which shouldn't be copy pasted
) as bazValue
FROM foo) buz
WHERE bazValue >= CURDATE()
By using a HAVING clause (using column aliases in HAVING clauses is specific to MySQL):
SELECT foo.bar,
DATEDIFF(
# Some more advanced logic, such as IF(,,), which shouldn't be copy pasted
) as bazValue
FROM foo
HAVING bazValue >= CURDATE()
As documented under Problems with Column Aliases:
Standard SQL disallows references to column aliases in a WHERE clause. This restriction is imposed because when the WHERE clause is evaluated, the column value may not yet have been determined. For example, the following query is illegal:
SELECT id, COUNT(*) AS cnt FROM tbl_name
WHERE cnt > 0 GROUP BY id;
The WHERE clause determines which rows should be included in the GROUP BY clause, but it refers to the alias of a column value that is not known until after the rows have been selected, and grouped by the GROUP BY.
You can however reuse the aliased expression, and if it uses deterministic functions the query optimiser will ensure that cached results are reused:
SELECT foo.bar,
DATEDIFF(
-- your arguments
) as bazValue
FROM foo
WHERE DATEDIFF(
-- your arguments
) >= CURDATE()
Alternatively, you can move the filter into a HAVING clause (where aliased columns will already have been calculated and are therefore available) - but performance will suffer as indexes cannot be used and the filter will not be applied until after results have been compiled.
As MySQL doesn't support CTE, consider using inline view:
SELECT foo.bar,
FROM foo,
(SELECT DATEDIFF(
# Some more advanced logic, such as IF(,,), which shouldn't be copy pasted
) as bazValue
) AS iv
WHERE iv.bazValue >= CURDATE()

MySQL GROUP BY doesn't work when migrated to SQL Server 2012

I'm moving my Delphi application from MySQL to SQL server 2012. In MySQL I had this query:
SELECT *,(XS+S+M+L+XL+XXL+[1Size]+Custom) as Total FROM StockData
GROUP BY StyleNr,Customer,Color
ORDER BY StyleNr,Customer,Color
And it worked perfectly. But in Microsoft SQL Server 2012 this query says
Msg 8120, Level 16, State 1, Line 1
Column 'StockData.ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
If I change my query to:
SELECT *,([XS]+[S]+[M]+[L]+[XL]+[XXL]+[1Size]+[Custom]) total
FROM [dbo].[stockdata]
GROUP BY ID,StyleNr,Customer,Color
ORDER BY StyleNr,Customer,Color
Then I get this error:
Msg 8120, Level 16, State 1, Line 1
Column 'dbo.stockdata.XS' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Any ideas?
Here is the table's design view:
SQL Server is working as expected. You must include all items in your SELECT list in either a GROUP BY or in an aggregate function:
SELECT *,(XS+S+M+L+XL+XXL+[1Size]+Custom) as Total
FROM StockData
-- GROUP BY ID,StyleNr,Customer,Color, XS,S,M,L,XL,XXL,[1Size],Custom
ORDER BY StyleNr,Customer,Color
Or you might be able to use:
SELECT StyleNr,Customer,Color, SUM(XS+S+M+L+XL+XXL+[1Size]+Custom) as Total
FROM StockData
GROUP BY StyleNr,Customer,Color
ORDER BY StyleNr,Customer,Color;
All columns in an aggregate query must either be used by an aggregate function or a group by. Try only selecting the columns you require rather than * I.e. select stylenr, customer, color, ([...] ) as Total from.
This is a SQL standard way of dealing with aggregates, you'd get a similar error in Oracle.
You can also use this approach:
with OrdinalOnGroup
(
SELECT
Ordinal = rank() over(partition by StyleNr, Customer, Color order by id)
, *, (XS+S+M+L+XL+XXL+[1Size]+Custom) as Total
FROM StockData
)
select *
from OrdinalOnGroup
where Ordinal = 1;
PARTITION BY denotes the grouping of related information, this is called windowing