Pivot data in snowflake like pandas pivot - mysql

Lately, I have been trying to pivot a table in snowflake and replicate a transformation operation in snowflake which is presently being done in pandas like the following:
I have a dataframe like the below:
I have been able to convert this into the following format:
Using code below:
dd = pd.pivot(df[['customerid', 'filter_', 'sum', 'count', 'max']], index='customerid', columns='filter_')
dd = dd.set_axis(dd.columns.map('_'.join), axis=1, inplace=False).reset_index()
I have been trying to do this in snowflake but am unable to get the same format. Here's what I have tried:
with temp as (
SELECT $1 as customerid, $2 as perfiosid, $3 as filter_, $4 as sum_, $5 as count_, $6 as max_
FROM
VALUES ('a', 'b', 'c', 10, 100, 1000),
('a', 'b', 'c1', 9, 900, 9000),
('a', 'b', 'c2', 80, 800, 8000),
('x', 'b', 'c', 10, 100, 1000),
('x', 'b', 'c1', 9, 900, 9000),
('x', 'b', 'c2', 80, 800, 8000))
,
cte as (
select *, 'SUM_' as idx
from temp pivot ( max(sum_) for filter_ in ('c', 'c1', 'c2'))
union all
select *, 'COUNT_' as idx
from temp pivot ( max(count_) for filter_ in ('c', 'c1', 'c2'))
union all
select *, 'MAX_' as idx
from temp pivot ( max(max_) for filter_ in ('c', 'c1', 'c2'))
order by customerid, perfiosid
)
-- select * from cte;
select customerid, perfiosid, idx, max("'c'") as c, max("'c1'") as c1, max("'c2'") as c2
from cte
group by 1, 2, 3
order by 1, 2, 3
The output I get from this is:
Note: I have 3k fixed filters per customerid and 18 columns like sum, count, max, min, stddev, etc. So the final output must be 54k columns for each customerid. How can I achieve this while being within the limits of 1 MB statement execution of snowflake?

Using conditional aggregation:
with temp as (
SELECT $1 as customerid, $2 as perfiosid, $3 as filter_, $4 as sum_, $5 as count_, $6 as max_
FROM
VALUES ('a', 'b', 'c', 10, 100, 1000),
('a', 'b', 'c1', 9, 900, 9000),
('a', 'b', 'c2', 80, 800, 8000),
('x', 'b', 'c', 10, 100, 1000),
('x', 'b', 'c1', 9, 900, 9000),
('x', 'b', 'c2', 80, 800, 8000)
)
SELECT customerid,
SUM(CASE WHEN FILTER_ = 'c' THEN SUM_ END) AS SUM_C,
SUM(CASE WHEN FILTER_ = 'c1' THEN SUM_ END) AS SUM_C1,
SUM(CASE WHEN FILTER_ = 'c2' THEN SUM_ END) AS SUM_C2,
SUM(CASE WHEN FILTER_ = 'c' THEN COUNT_ END) AS COUNT_C,
SUM(CASE WHEN FILTER_ = 'c1' THEN COUNT_ END) AS COUNT_C1,
SUM(CASE WHEN FILTER_ = 'c2' THEN COUNT_ END) AS COUNT_C2,
MAX(CASE WHEN FILTER_ = 'c' THEN MAX_ END) AS MAX_C,
MAX(CASE WHEN FILTER_ = 'c1' THEN MAX_ END) AS MAX_C1,
MAX(CASE WHEN FILTER_ = 'c2' THEN MAX_ END) AS MAX_C2
FROM temp
GROUP BY customerid;
Output:
To match the 1MB query limit the output could be splitted and materialized in temporary table first like:
CREATE TEMPORARY TABLE t_SUM
AS
SELECT customer_id,
SUM(...)
FROM tab;
CREATE TEMPORARY TABLE t_COUNT
AS
SELECT customer_id,
SUM(...)
FROM tab;
CREATE TEMPORARY TABLE t_MAX
AS
SELECT customer_id,
SUM(...)
FROM tab;
Combined query:
SELECT *
FROM t_SUM AS s
JOIN t_COUNT AS c
ON s.customer_id = c.customer_id
JOIN t_MAX AS m
ON m.customer_id = c.customer_id
-- ...

you cannot ask 54k sets of three column3 in a query, because:
the 50,000th set looks like (if precomputed into tables like Lukasz suggests)
s.s_50000 as sum_50000,
c.c_50000 as count_50000,
m.m_50000 as max_50000,
is 75 bytes, and 54K * 75 = 4,050,000 so even asking for 54K columns (you are have 18K sets of 3 columns) would 1.3MB so too larger.
Which means you have to build your temp tables, as suggested by Lukasz, you would have to use:
select s.customer_id, s.*, c.*, m.*
from sums as s
join counts as c on s.customer_id = c.customer_id
join maxs c on m.customer_id = c.customer_id
but building those temp tables has 18K columns of
SUM(IFF(FILTER_='c18000',SUM_,null)) AS SUM_18000
is 50 bytes, thus 18K of those lines takes 90K, so that might work.
But you then have problems like this person with their 8K columns started having prbolems:
https://community.snowflake.com/s/question/0D50Z00007CZcqmSAD/what-is-limit-on-number-of-columns-how-to-do-a-sparse-table
which is to all say, this thing you are doing seems of very low value, what system is going to make sense of 50K+ columns of data that can not handling processing many rows. It just feels like a, Tool A we know how to do Z and not Y, so Tool B must produce answers in Z format verse the natural concepts of Y..

Related

Logical Duplicates in mysql

My table is like below :
city1 city 2 distance flag
A B 200 Y
C D 300 N
B A 200 N
My requirement is to check if A & B (or B & A) exists with flag=Y.
Any help will be highly useful
Simple Concatenation and Conditional check will work out as shown in below
CREATE TABLE #TempTable(
city1 VARCHAR(MAX),
city2 VARCHAR(MAX), distance int, flag VARCHAR(1))
INSERT INTO #TempTable (city1, city2, distance,flag)
VALUES ('A', 'B', 200, 'Y'),
('C', 'D', 300, 'N'),
('B', 'A', 200, 'N'),
('B', 'A', 2100, 'Y')
SELECT * FROM #TempTable
WHERE (city1+City2='AB' AND Flag='Y') OR (city1+City2='BA' AND Flag='Y')
You can use SELF JOIN to do this, e.g.:
SELECT t1.*
FROM `table` t1 JOIN `table` t2
ON t1.city1 = t2.city2 AND t1.city2 = t2.city1 AND t1.flag = t2.flag
WHERE t1.flag = 'Y';
Here's the SQL Fiddle.

MYSQL Calculate average of a specific occurance in a column

I need to calculate the average of occurrences in a dataset for a given value in a column. I made an easy example but in my current database contains around 2 inner joins to reduce it to 100k records. I need to perform the following select distinct statement for 10 columns.
My current design forces an inner join for each column. Another constraint is that I need to perform it at least 50-100 rows for each name in this example.
I need to figure out an efficient way to calculate this values without using too many resources while making the query fast.
http://sqlfiddle.com/#!9/c2378/3
My expected Result is:
Name | R Avg dir | L Avg dir 1 | L Avg dir 2 | L Avg dir 3
A 0 .5 .25 .25
Create table query:
CREATE TABLE demo
(`id` int, `name` varchar(10),`hand` varchar(1), `dir` int)
;
INSERT INTO demo
(`id`, `name`, `hand`, `dir`)
VALUES
(1, 'A', 'L', 1),
(2, 'A', 'L', 1),
(3, 'A', 'L', 2),
(4, 'A', 'L', 3),
(5, 'A', 'R', 3),
(6, 'A', 'R', 3)
;
Example Query:
SELECT distinct name,
COALESCE(( (Select count(id) as 'cd' from demo where hand = 'L' AND dir = 1) /(Select count(id) as 'fd' from demo where hand = 'L')),0) as 'L AVG dir'
FROM
demo
where hand = 'L' AND dir = 1 AND name = 'A'
One option is to use conditional aggregation:
SELECT name,
count(case when hand = 'L' and dir = 1 then 1 end) /
count(case when hand = 'L' then 1 end) L1Avg,
count(case when hand = 'L' and dir = 2 then 1 end) /
count(case when hand = 'L' then 1 end) L2Avg,
count(case when hand = 'L' and dir = 3 then 1 end) /
count(case when hand = 'L' then 1 end) L3Avg,
count(case when hand = 'R' and dir = 3 then 1 end) /
count(case when hand = 'R' then 1 end) RAvg
FROM demo
WHERE name = 'A'
GROUP BY name
Updated Fiddle Demo
Please note, I wasn't 100% sure why you wanted your RAvg to be 0 -- I assumed you meant 100%. If not, you can adjust the above accordingly.

SQL query - credit , debit , balance

DISCLAIMER : I Know this has been asked numerous times, but all I want is an alternative.
The table is as below :
create table
Account
(Name varchar(20),
TType varchar(5),
Amount int);
insert into Account Values
('abc' ,'c', 500),
('abc', 'c', 700),
('abc', 'd', 100),
('abc', 'd', 200),
('ab' ,'c', 300),
('ab', 'c', 700),
('ab', 'd', 200),
('ab', 'd', 200);
Expected result is simple:
Name Balance
------ -----------
ab 600
abc 900
The query that worked is :
select Name, sum(case TType when 'c' then Amount
when 'd' then Amount * -1 end) as balance
from Account a1
group by Name.
All I want is, is there any query sans the 'case' statement (like subquery or self join ) for the same result?
Sure. You can use a second query with a where clause and a union all:
select name
, sum(Amount) balance
from Account a1
where TType when 'c'
group
by Name
union
all
select name
, sum(Amount * -1) balance
from Account a1
where TType when 'd'
group
by Name
Or this, using a join with an inline view:
select name
, sum(Amount * o.mult) balance
from Account a1
join ( select 'c' cd
, 1 mult
from dual
union all
select 'd'
, -1
from dual
) o
on o.cd = a1.TType
group
by Name
To be honest, I would suggest to use case...
Use the ASCII code of the char and try to go from there. It is 100 for 'd' and 99 for 'c'. Untested example:
select Name, sum((ASCII(TType) - 100) * Amount * (-1)) + sum((ASCII(TType) - 99) * Amount * (-1)))) as balance from Account a1 group by Name.
I would not recommend using this method but it is a way of achieving what you want.
select t.Name, sum(t.cr) - sum(t.dr) as balance from (select Name, case TType when 'c' then sum(Amount) else 0 end as cr, case TType when 'd' then sum(Amount) else 0 end as dr from Account group by Name, TType) t group by t.Name;
This will surely help you!!
The following worked for me on Microsoft SQL server. It has the Brought Forward balance as well
WITH tempDebitCredit AS (
Select 0 As Details_ID, null As Creation_Date, null As Reference_ID, 'Brought
Forward' As Transaction_Kind, null As Amount_Debit, null As Amount_Credit,
isNull(Sum(Amount_Debit - Amount_Credit), 0) 'diff'
From _YourTable_Name
where Account_ID = #Account_ID
And Creation_Date < #Query_Start_Date
Union All
SELECT a.Details_ID, a.Creation_Date, a.Reference_ID, a.Transaction_Kind,
a.Amount_Debit, a.Amount_Credit, a.Amount_Debit - a.Amount_Credit 'diff'
FROM _YourTable_Name a
where Account_ID = #Account_ID
And Creation_Date >= #Query_Start_Date And Creation_Date <= #Query_End_Date
)
SELECT a.Details_ID, a.Creation_Date, a.Reference_ID, a.Transaction_Kind,
a.Amount_Debit, a.Amount_Credit, SUM(b.diff) 'Balance'
FROM tempDebitCredit a, tempDebitCredit b
WHERE b.Details_ID <= a.Details_ID
GROUP BY a.Details_ID, a.Creation_Date, a.Reference_ID, a.Transaction_Kind,
a.Amount_Debit, a.Amount_Credit
Order By a.Details_ID Desc

Count occurrences that differ within a column

I want to be able to select the amount of times the data in columns Somedata_A and Somedata_B has changed from the from the previous row within its column. I've tried using DISTINCT and it works to some degree. {1,2,3,2,1,1} will show 3 when I want it to show 4 course there's 5 different values in sequence.
Example:
A,B,C,D,E,F
{1,2,3,2,1,1}
A compare to B gives a difference, B compare to C gives a difference . . . E compare to F gives not difference. All in all it gives 4 differences within a set of 6 values.
I have gotten DISTINCT to work but it does not really do the trick for me. And to add more to the question I'm really not interested it the whole range, lets say just the 2 last days/entries per Title.
Second I'm concern about performance issues. I tried the query below on a real set of data and it got interrupted probably due to timeout.
SQL Fiddle
MySQL 5.5.32 Schema Setup:
CREATE TABLE testdata(
Title varchar(10),
Date varchar(10),
Somedata_A int(5),
Somedata_B int(5));
INSERT INTO testdata (Title, Date, Somedata_A, Somedata_B) VALUES
("Alpha", '123', 1, 2),
("Alpha", '234', 2, 2),
("Alpha", '345', 1, 2),
("Alpha", '349', 1, 2),
("Alpha", '456', 1, 2),
("Omega", '123', 1, 1),
("Omega", '234', 2, 2),
("Omega", '345', 3, 3),
("Omega", '349', 4, 3),
("Omega", '456', 5, 4),
("Delta", '123', 1, 1),
("Delta", '234', 2, 2),
("Delta", '345', 1, 3),
("Delta", '349', 2, 3),
("Delta", '456', 1, 4);
Query 1:
SELECT t.Title, (SELECT COUNT(DISTINCT Somedata_A) FROM testdata AS tt WHERE t.Title = tt.Title) AS A,
(SELECT COUNT(DISTINCT Somedata_B) FROM testdata AS tt WHERE t.Title = tt.Title) AS B
FROM testdata AS t
GROUP BY t.Title
Results:
| TITLE | A | B |
|-------|---|---|
| Alpha | 2 | 1 |
| Delta | 2 | 4 |
| Omega | 5 | 4 |
Something like this may work: it uses a variable for row number, joins on an offset of 1 and then counts differences for A and B.
http://sqlfiddle.com/#!2/3bbc8/9/2
set #i = 0;
set #j = 0;
Select
A.Title aTitle,
sum(Case when A.SomeData_A <> B.SomeData_A then 1 else 0 end) AVar,
sum(Case when A.SomeData_B <> B.SomeData_B then 1 else 0 end) BVar
from
(SELECT Title, #i:=#i+1 as ROWID, SomeData_A, SomeData_B
FROM testdata
ORDER BY Title, date desc) as A
INNER JOIN
(SELECT Title, #j:=#j+1 as ROWID, SomeData_A, SomeData_B
FROM testdata
ORDER BY Title, date desc) as B
ON A.RowID= B.RowID + 1
AND A.Title=B.Title
Group by A.Title
This works (see here) (FYI: Your results in the question do not match your data - for instance, for Alpha, ColumnA: it never changes from 1. The answer should be 0)
Hopefully you can adapt this Statement to your actual data model
SELECT t1.title, SUM(t1.Somedata_A<>t2.Somedata_a) as SomeData_A
,SUM(t1.Somedata_b<>t2.Somedata_b) as SomeData_B
FROM testdata AS t1
JOIN testdata AS t2
ON t1.title = t2.title
AND t2.date = DATE_ADD(t1.date, INTERVAL 1 DAY)
GROUP BY t1.title
ORDER BY t1.title;

Group by, with rank and sum - not getting correct output

I'm trying to sum a column with rank function and group by month, my code is
select dbo.UpCase( REPLACE( p.Agent_name,'.',' '))as Agent_name, SUM(convert ( float ,
p.Amount))as amount,
RANK() over( order by SUM(convert ( float ,Amount )) desc ) as arank
from dbo.T_Client_Pc_Reg p
group by p.Agent_name ,p.Sale_status ,MONTH(Reg_date)
having [p].Sale_status='Activated'
Currently I'm getting all total value of that column not month wise
Name amount rank
a 100 1
b 80 2
c 50 3
for a amount 100 is total amount till now but , i want get current month total amount not last months..
Maybe you just need to add a WHERE clause? Here is a minor re-write that I think works generally better. Some setup in tempdb:
USE tempdb;
GO
CREATE TABLE dbo.T_Client_Pc_Reg
(
Agent_name VARCHAR(32),
Amount INT,
Sale_Status VARCHAR(32),
Reg_date DATETIME
);
INSERT dbo.T_Client_Pc_Reg
SELECT 'a', 50, 'Activated', GETDATE()
UNION ALL SELECT 'a', 50, 'Activated', GETDATE()
UNION ALL SELECT 'b', 20, 'Activated', GETDATE()
UNION ALL SELECT 'b', 20, 'Activated', GETDATE()
UNION ALL SELECT 'b', 20, 'Activated', GETDATE()
UNION ALL SELECT 'b', 20, 'Activated', GETDATE()
UNION ALL SELECT 'b', 20, 'NotActivated', GETDATE()
UNION ALL SELECT 'c', 25, 'Activated', GETDATE()
UNION ALL SELECT 'c', 25, 'Activated', GETDATE()
UNION ALL SELECT 'c', 25, 'Activated', GETDATE()-40;
Then the query:
SELECT
Agent_name = UPPER(REPLACE(Agent_name, '.', '')),
Amount = SUM(CONVERT(FLOAT, Amount)),
arank = RANK() OVER (ORDER BY SUM(CONVERT(FLOAT, Amount)) DESC)
FROM dbo.T_Client_Pc_Reg
WHERE Reg_date >= DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP), 0)
AND Reg_date < DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) + 1, 0)
AND Sale_status = 'Activated'
GROUP BY UPPER(REPLACE(Agent_name, '.', ''))
ORDER BY arank;
Now cleanup:
USE tempdb;
GO
DROP TABLE dbo.T_Client_Pc_Reg;