Presto equivalent for Redshift's PERCENTILE_DISC - mysql

Given a query below in Redshift:
select
distinct cast(joinstart_ev_timestamp as date) as session_date,
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where
cast(joinstart_ev_timestamp as date) between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%'
and join_time > 0 and join_time <= 600000 and join_time is not null
and audio_connect_time >= 0
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0 or panel_connect_time is null) and version = 'V2'
I need to convert above Query to corresponding Presto syntax.
Corresponding Presto query I wrote is:
select
distinct cast(joinstart_ev_timestamp as date) as session_date,
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double))
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where
cast(joinstart_ev_timestamp as date) between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%'
and join_time > 0 and join_time <= 600000 and join_time is not null
and audio_connect_time >= 0
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0 or panel_connect_time is null) and version = 'V2'
Here, everything is working fine but it is showing error in the line:
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double))
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
What will be its corresponding Presto Syntax?

If Presto supported nested window functions then you could use NTH_VALUE along with p*COUNT(*) OVER (PARTITION BY ...) to find the offset corresponding to the "p'th" percentile in the window. Since Presto doesn't support this, you need to join to a subquery that calculates the number of records in the window instead:
SELECT
my_table.window_column,
/* Replace :p with the desired percentile (in your case, 0.02) */
NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
SELECT
window_column,
COUNT(*) AS records_in_window
FROM my_table
GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column
The above is conceptually close but fails because :p*subquery.records_in_window is a float and the offset needs to be an integer. You have a few options for how to deal with this this. For example, if you're finding the median then simply rounding to the nearest integer works. If you're finding the 2nd percentile, rounding won't work because it will often give you 0 and the offset starts at 1. In that case, rounding the ceiling to the nearest integer might be better.

I was doing some research on median in presto, and found a solution that worked for me:
For example, I had a join table, A_join_B, that has columns A_id and B_id.
I wanted to find median of number of A related to a single B
SELECT APPPROX_PERCENTILE(count, 0.5)
FROM
(
SELECT COUNT(*) AS count, narrative_id
FROM A_join_B
GROUP BY B_id
) as counts;

Related

Calculating consecutive occurences in MySQL

I have a quick question in relation to windowing in MySQL
SELECT
Client,
User,
Date,
Flag,
lag(Date) over (partition by Client,User order by Date asc) as last_date,
lag(Flag) over (partition by Client,User order by Date asc) as last_flag,
case when Flag = 1 and last_flag = 1 then 1 else 0 end as consecutive
FROM db.tbl
This query returns something like the below. I am trying to work out the number of consecutive times that the Flag column was 1 for each user most recently, if they had 11110000111 then we should take the final three occurences of 1 to determine that they had a consecutive flag of 3 times.
I need to extract the start and end date for the consecutive flag.
How would I go about doing this, can anyone help me :)
If we use the example of 11110000111 then we should extract only 111 and therefore the 3 most recent dates for that customer. So in the below, we would need to take 10.01.2023 as the first date and 24.01.2023 as the last date. The consecutive count should be 3
Output:
Use aggregation and string functions:
WITH cte AS (
SELECT Client, User,
GROUP_CONCAT(CASE WHEN Flag THEN Date END ORDER BY Date) AS dates,
CHAR_LENGTH(SUBSTRING_INDEX(GROUP_CONCAT(Flag ORDER BY Date SEPARATOR ''), '0', '-1')) AS consecutive
FROM tablename
GROUP BY Client, User
)
SELECT Client, User,
NULLIF(SUBSTRING_INDEX(SUBSTRING_INDEX(dates, ',', -consecutive), ',', 1), '') AS first_date,
CASE WHEN consecutive > 0 THEN SUBSTRING_INDEX(dates, ',', -1) END AS last_date,
consecutive
FROM cte;
Another solution with window functions and conditional aggregation:
WITH
cte1 AS (SELECT *, SUM(NOT Flag) OVER (PARTITION BY Client, User ORDER BY Date) AS grp FROM tablename),
cte2 AS (SELECT *, MAX(grp) OVER (PARTITION BY Client, User) AS max_grp FROM cte1)
SELECT Client, User,
MIN(CASE WHEN Flag THEN Date END) AS first_date,
MAX(CASE WHEN Flag THEN Date END) AS last_date,
SUM(Flag) AS consecutive
FROM cte2
WHERE grp = max_grp
GROUP BY Client, User;
See the demo.
Made an attempt to get the result with more simpler queries and here is my approach taking advantage of lastDate and lastFlag column too.
Run here
WITH eTT
AS
( SELECT Client, User, NULLIF(MAX(Date),
(SELECT MAX(Date) FROM tt t2 WHERE t1.Client=t2.Client AND t1.User=t2.User)) as endDate
FROM tt t1 WHERE LastFlag=0 OR LastFlag IS NULL GROUP BY Client, User
)
SELECT Client, User,
(CASE WHEN MAX(endDate) IS NULL THEN NULL ELSE MIN(Date) END) as first_date,
(CASE WHEN MAX(endDate) IS NULL THEN NULL ELSE MAX(Date) END) as last_date,
(CASE WHEN MAX(endDate) IS NULL THEN NULL ELSE COUNT(endDate) END) as consecutive
FROM tt LEFT JOIN eTT USING (Client, User)
WHERE Date >= endDate OR endDate IS null GROUP BY Client, User;
EDIT
The original table doesn't have LastDate and LastFlag columns and were created using OP's initial query.
Since the method used is not apparantly supported but I get an impression that OP somehow manages to do that on their side.
Hence another cte called tt can be added before eTT containing that query.

Mysql Sum over partition by

Hi I am doing MySQL and using 'Sum over (partition by )'
I want to see the values are adding up by following lines like below
but my result is like just
I'm using the following query:
select dea.location, sum(cast(vac.new_vaccinations as signed)) over (partition by dea.location order by dea.location)
From pr.CovidDeaths_csv dea
join pr.CovidVaccinations_csv vac
on dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
order by 2;
Does anyone know about this problem?
You're missing the frame specification for window functions in MySQL. It allows you to apply a cumulative sum instead of a static sum:
select dea.location,
sum(cast(vac.new_vaccinations as signed))
over(partition by dea.location
order by dea.location ROWS UNBOUNDED PRECEDING)
From pr.CovidDeaths_csv dea
join pr.CovidVaccinations_csv vac
on dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
order by 2;
As you've not shared your data from all your tables, I cannot replicate your case, but you can see an analogous pattern on sample data here.

Select latest record with date and time column with group by clause in mysql

I have this table from which I have to select the latest row on the basis of date and time column for each checkpost
I have tried the following queries but not returning the latest data for each checkpost.
SELECT checkpost_id,current_rate,date,time FROM revisionrates
WHERE date IN (SELECT max(date) FROM revisionrates GROUP BY checkpost_id)
The expected output is
You can use window functions:
select rr.*
from (select rr.*,
row_number() over (partition by checkpost_id order by date desc, time desc) as seqnum
from revisionrates rr
) rr
where seqnum = 1;
This requires MySQL 8.0. In earlier versions of MySQL, this is a bit trickier, but one method uses tuples
select rr.*
from rr
where (date, time) in (select rr2.date, rr2.time
from revisionrates rr2
where rr2.checkpoint_id = rr.checkpoint_id
order by rr2.date desc, rr2.time desc
limit 1
);

SQL query that finds a negative change between two rows with the same name field

I have a single table with rows like this: (Date, Score, Name)
The Date field has two possible dates, and it's possible that a Name value will appear under only one date (if that name was recently added or removed).
I'm looking to get a table with rows like this: (Delta, Name), where delta is the score change for each name between the earlier and later dates. In addition, only a negative change interests me, so if Delta>=0, it shouldn't appear in the output table at all.
My main challenge for me is calculating the Delta field.
As stated in the title, it should be an SQL query.
Thanks in advance for any help!
I assumed that each name can have it's own start/end dates. It can be simplified significantly if there are only two possible dates for the entire table.
I tried this out in SQL Fiddle here
SELECT (score_end - score_start) delta, name_start
FROM
( SELECT date date_start, score score_start, name name_start
FROM t t
WHERE NOT EXISTS
( SELECT 1
FROM t x
WHERE x.date < t.date
AND x.name = t.name
)
) AS start_date_t
JOIN
( SELECT date date_end, score score_end, name name_end
FROM t t
WHERE NOT EXISTS
( SELECT 1
FROM t x
WHERE x.date > t.date
AND x.name = t.name
)
) end_date_t ON start_date_t.name_start = end_date_t.name_end
WHERE score_end-score_start < 0
lets say you have a table with date_value, sum_value
Then it should be something like that:
select t.date_value,sum_value,
sum_value - COALESCE((
select top 1 sum_value
from tmp_num
where date_value > t.date_value
order by date_value
),0) as sum_change
from tmp_num as t
order by t.date_value
The following uses a "trick" in MySQL that I don't really like using, because it turns the score into a string and then back into a number. But, it is an easy way to get what you want:
select t.name, (lastscore - firstscore) as diff
from (select t.name,
substring_index(group_concat(score order by date asc), ',', 1) as firstscore,
substring_index(group_concat(score order by date desc), ',', 1) as lastscore
from table t
group by t.name
) t
where lastscore - firstscore < 0;
If MySQL supported window functions, such tricks wouldn't be necessary.

sql calculate change and percent by year

I have an data set that simulates the rate of return for a trading account. There is an entry for each day showing the balance and the open equity. I want to calculate the yearly, or quarterly, or monthly change and percent gain or loss. I have this working for daily data, but for some reason I can't seem to get it to work for yearly data.
The code for daily data follows:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM tim_account_history p
CROSS JOIN
(SELECT #pequity:= NULL
FROM tim_account_history
ORDER by `Date` LIMIT 1) as a
ORDER BY `Date`) as b
ORDER by `Date` ASC
Grouping by YEAR(Date) doesn't seem to make the desired difference. I have tried everything I can think of, but it still seems to return daily rate of change even if you group by month or year, etc. I think I'm not using windowing correctly, but I can't seem to figure it out. If anyone knows of a good book about this sort of query I'd appreciate that also.
Thanks.sqlfiddle example
Using what Lolo contributed, I have added some code so the data comes from the last day of the year, instead of the first. I also just need the Open_Equity, not the sum.
I'm still not certain I understand why this works, but it does give me what I was looking for. Using another select statement as a from seems to be the key here; I don't think I would have come up with this without Lolo's help. Thank you.
SELECT b.`yyyy`, b.Open_Equity,
concat('$',round(delta, 2)) as delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT (EXTRACT(YEAR FROM `Date`)) as `yyyy`,
(SUBSTRING_INDEX(GROUP_CONCAT(CAST(`Open_Equity` AS CHAR) ORDER BY `Date` DESC), ',', 1 )) AS `Open_Equity`
FROM tim_account_history GROUP BY `yyyy` ORDER BY `yyyy` DESC) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `yyyy` ) as b
ORDER by `yyyy` ASC
Try this:
SELECT b.`Date`, b.Open_Equity, delta,
concat(round(delta_p*100,4),'%') as delta_p
FROM (SELECT *,
(Open_Equity - #pequity) as delta,
(Open_Equity - #pequity)/#pequity as delta_p,
(#pequity:= Open_Equity)
FROM (SELECT YEAR(`Date`) `Date`, SUM(Open_Equity) Open_Equity FROM tim_account_history GROUP BY YEAR(`Date`)) p
CROSS JOIN
(SELECT #pequity:= NULL) as a
ORDER BY `Date` ) as b
ORDER by `Date` ASC