Mysql Sum over partition by - mysql

Hi I am doing MySQL and using 'Sum over (partition by )'
I want to see the values are adding up by following lines like below
but my result is like just
I'm using the following query:
select dea.location, sum(cast(vac.new_vaccinations as signed)) over (partition by dea.location order by dea.location)
From pr.CovidDeaths_csv dea
join pr.CovidVaccinations_csv vac
on dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
order by 2;
Does anyone know about this problem?

You're missing the frame specification for window functions in MySQL. It allows you to apply a cumulative sum instead of a static sum:
select dea.location,
sum(cast(vac.new_vaccinations as signed))
over(partition by dea.location
order by dea.location ROWS UNBOUNDED PRECEDING)
From pr.CovidDeaths_csv dea
join pr.CovidVaccinations_csv vac
on dea.location = vac.location
and dea.date = vac.date
where dea.continent is not null
order by 2;
As you've not shared your data from all your tables, I cannot replicate your case, but you can see an analogous pattern on sample data here.

Related

Presto equivalent for Redshift's PERCENTILE_DISC

Given a query below in Redshift:
select
distinct cast(joinstart_ev_timestamp as date) as session_date,
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where
cast(joinstart_ev_timestamp as date) between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%'
and join_time > 0 and join_time <= 600000 and join_time is not null
and audio_connect_time >= 0
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0 or panel_connect_time is null) and version = 'V2'
I need to convert above Query to corresponding Presto syntax.
Corresponding Presto query I wrote is:
select
distinct cast(joinstart_ev_timestamp as date) as session_date,
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double))
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where
cast(joinstart_ev_timestamp as date) between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%'
and join_time > 0 and join_time <= 600000 and join_time is not null
and audio_connect_time >= 0
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0 or panel_connect_time is null) and version = 'V2'
Here, everything is working fine but it is showing error in the line:
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double))
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
What will be its corresponding Presto Syntax?
If Presto supported nested window functions then you could use NTH_VALUE along with p*COUNT(*) OVER (PARTITION BY ...) to find the offset corresponding to the "p'th" percentile in the window. Since Presto doesn't support this, you need to join to a subquery that calculates the number of records in the window instead:
SELECT
my_table.window_column,
/* Replace :p with the desired percentile (in your case, 0.02) */
NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
SELECT
window_column,
COUNT(*) AS records_in_window
FROM my_table
GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column
The above is conceptually close but fails because :p*subquery.records_in_window is a float and the offset needs to be an integer. You have a few options for how to deal with this this. For example, if you're finding the median then simply rounding to the nearest integer works. If you're finding the 2nd percentile, rounding won't work because it will often give you 0 and the offset starts at 1. In that case, rounding the ceiling to the nearest integer might be better.
I was doing some research on median in presto, and found a solution that worked for me:
For example, I had a join table, A_join_B, that has columns A_id and B_id.
I wanted to find median of number of A related to a single B
SELECT APPPROX_PERCENTILE(count, 0.5)
FROM
(
SELECT COUNT(*) AS count, narrative_id
FROM A_join_B
GROUP BY B_id
) as counts;

Eliminate First 14 For Each Symbol From Query

The following query pulls all rows that do not exist in a relative_strength_index table. But I also need to eliminate the first 14 rows for each symbol based on date asc from the historical_data table. I have tried several attempts to do this but am having real trouble with the 14 days. How could this issue be resolved and added into my current query?
Current Query
select *
from historical_data hd
where not exists (select rsi_symbol, rsi_date from relative_strength_index where hd.symbol = rsi_symbol and hd.histDate = rsi_date);
What you want is the first argument of the limit clause. Which states which row to start from accompanied by order by asc.
select * from historical_data hd where not exists (select rsi_symbol, rsi_date from relative_strength_index where hd.symbol = rsi_symbol and hd.histDate = rsi_date ORDER BY rsi_date ASC LIMIT 14)
use OFFSET along with LIMIT like this this will return maximum of 100,000 rows starting at row 15
select *
from historical_data hd
where not exists (select rsi_symbol, rsi_date from relative_strength_index where hd.symbol = rsi_symbol and hd.histDate = rsi_date)
order by date asc
limit 100000 offset 14;
but because you're using limit and offset, you might want to ORDER BY by some order before specifying limit and offset.
UPDATE you mentioned for each symbol, so try this query, it ranks each symbol based on date asc, then only selects rows where rank >= 15
SELECT *
FROM
(select hd.*,
CASE WHEN #previous_symbol = hd.symbol THEN #rank:=#rank+1
ELSE #rank := 1
END as rank,
#previous_symbol := hd.symbol
from historical_data hd
where not exists (select rsi_symbol, rsi_date from relative_strength_index where hd.symbol = rsi_symbol and hd.histDate = rsi_date)
order by hd.symbol, hd.date asc
)T
WHERE T.rank >= 15
It's not clear (to me) what resultset you want to return, or the conditions that specify whether a row should be returned.
All we have to go on is a confusingly vague description, to exclude "the first 14 rows", or "the first 14 days" for each symbol.
What we don't have is a represetative sample of the data, or an example of what rows should be returned.
Without that, we don't have a way to know if we understand the description of the specification, and we don't have anything to test against or to compare our results to.
So, we are basically just guessing. (Which seems to be the most popular kind of answer provided by the "try this" enthusiatss.)
I can provide some examples of some patterns, which may suit your specification, or may not.
To get the earliest `histdate` for each `symbol`, and add 14 days to that, we can use an inline view. We can then do a semi-join to the `historical_data` data, to exclude rows that have a `histdate` before the date returned from the inline view.
(This is based on an assumption that the datatype of the `histdate` column is DATE.)
SELECT hd.*
FROM ( SELECT d.symbol
, MIN(d.histdate) + INTERVAL 14 DAY AS histdate
FROM historical_data d
GROUP BY d.symbol
) dd
JOIN historical_data hd
ON hd.symbol = dd.symbol
AND hd.histdate > dd.histdate
ORDER
BY hd.symbol
, hd.histdate
But that query doesn't include any reference to the `relative_strength_index` table. The original query includes a NOT EXISTS predicate, with a correlated subquery of the `relative_strength_index` table.
If the goal is get the earliest `rsi_date` for each `rsi_symbol` from that table, and then add 14 days to that value...
SELECT hd.*
FROM ( SELECT rsi.rsi_symbol
, MIN(rsi.rsi_date) + INTERVAL 14 DAY AS rsi_date
FROM relative_strength_index rsi
GROUP BY rsi.rsi_symbol
) rs
JOIN historical_data hd
ON hd.symbol = rs.rsi_symbol
ON hd.histdate > rs.rsi_date
ORDER
BY hd.symbol
, hd.histdate
If the goal is to exclude rows where a matching row in relative_strength_index already exists, I would use an anti-join pattern...
SELECT hd.*
FROM ( SELECT d.symbol
, MIN(d.histdate) + INTERVAL 14 DAY AS histdate
FROM historical_data d
GROUP BY d.symbol
) dd
JOIN historical_data hd
ON hd.symbol = dd.symbol
AND hd.histdate > dd.histdate
LEFT
JOIN relative_strength_index xr
ON xr.rsi_symbol = hd.symbol
AND xr.rsi_date = hd.histdate
WHERE xr.rsi_symbol IS NULL
ORDER
BY hd.symbol
, hd.histdate
These are just example query patterns, which are likely not suited to your exact specification, since they are guesses.
It doesn't make much sense to provide more examples of other patterns, without a more detailed specification.

Simplify CASE expression used multiple times

For readability, I would like to modify the below statement. Is there a way to extract the CASE statement, so I can use it multiple times without having to write it out every time?
select
mturk_worker.notes,
worker_id,
count(worker_id) answers,
count(episode_has_accepted_imdb_url) scored,
sum( case when isnull(imdb_url) and isnull(accepted_imdb_url) then 1
when imdb_url = accepted_imdb_url then 1
else 0 end ) correct,
100 * ( sum( case when isnull(imdb_url) and isnull(accepted_imdb_url) then 1
when imdb_url = accepted_imdb_url then 1
else 0 end)
/ count(episode_has_accepted_imdb_url) ) percentage
from
mturk_completion
inner join mturk_worker using (worker_id)
where
timestamp > '2015-02-01'
group by
worker_id
order by
percentage desc,
correct desc
You can actually eliminate the case statements. MySQL will interpret boolean expressions as integers in a numeric context (with 1 being true and 0 being false):
select mturk_worker.notes, worker_id, count(worker_id) answers,
count(episode_has_accepted_imdb_url) scored,
sum(imdb_url = accepted_imdb_url or imdb_url is null and accepted_idb_url is null) as correct,
(100 * sum(imdb_url = accepted_imdb_url or imdb_url is null and accepted_idb_url is null) / count(episode_has_accepted_imdb_url)
) as percentage
from mturk_completion inner join
mturk_worker
using (worker_id)
where timestamp > '2015-02-01'
group by worker_id
order by percentage desc, correct desc;
If you like, you can simplify it further by using the null-safe equals operator:
select mturk_worker.notes, worker_id, count(worker_id) answers,
count(episode_has_accepted_imdb_url) scored,
sum(imdb_url <=> accepted_imdb_url) as correct,
(100 * sum(imdb_url <=> accepted_imdb_url) / count(episode_has_accepted_imdb_url)
) as percentage
from mturk_completion inner join
mturk_worker
using (worker_id)
where timestamp > '2015-02-01'
group by worker_id
order by percentage desc, correct desc;
This isn't standard SQL, but it is perfectly fine in MySQL.
Otherwise, you would need to use a subquery, and there is additional overhead in MySQL associated with subqueries.

SQL select aggregate values in columns

I have a table in this structure:
editor_id
rev_user
rev_year
rev_month
rev_page
edit_count
here is the sqlFiddle: http://sqlfiddle.com/#!2/8cbb1/1
I need to surface the 5 most active editors during March 2011 for example - i.e. for each rev_user - sum all of the edit_count for each rev_month and rev_year to all of the rev_pages.
Any suggestions how to do it?
UPDATE -
updated fiddle with demo data
You should be able to do it like this:
Select the total using SUM and GROUP BY, filtering by rev_year and rev_month
Order by the SUM in descending order
Limit the results to the top five items
Here is how:
SELECT * FROM (
SELECT rev_user, SUM(edit_count) AS total_edits
FROM edit_count_user_date
rev_year='2006' AND rev_month='09'
GROUP BY rev_user
) x
ORDER BY total_edits DESC
LIMIT 5
Demo on sqlfiddle.
Surely this is as straightforward as :
SELECT rev_user, SUM(edit_count) as TotalEdits
FROM edit_count_user_date
WHERE rev_month = 'March' and rev_year = '2014'
GROUP BY rev_user
ORDER BY TotalEdits DESC
LIMIT 5;
SqlFiddle here
May I also suggest using a more appropriate DATE type for the year and month storage?
Edit, re new Info
The below will return all edits for the given month for the 'highest' MonthTotal editor, and then re-group the totals by the rev_page.
SELECT e.rev_user, e.rev_page, SUM(e.edit_count) as TotalEdits
FROM edit_count_user_date e
INNER JOIN
(
SELECT rev_user, rev_year, rev_month, SUM(edit_count) AS MonthTotal
FROM edit_count_user_date
WHERE rev_month = '09' and rev_year = '2010'
GROUP BY rev_user, rev_year, rev_month
ORDER BY MonthTotal DESC
LIMIT 1
) as x
ON e.rev_user = x.rev_user AND e.rev_month = x.rev_month AND e.rev_year = x.rev_year
GROUP BY e.rev_user, e.rev_page;
SqlFiddle here - I've adjusted the data to make it more interesting.
However, if you need to do this across several months at a time, it will be more difficult given MySql's lack of partition by / analytical windowing functions.

need the work around with the mysql query

Folks
when i m running the below query , i m getting the error for invalid use of group by function
SELECT `Margin`.`destination`,
ROUND(sum(duration),2) as total_duration,
sum(calls) as total_calls
FROM `ilax`.`margins` AS `Margin`
WHERE `date1` = '2013-08-30' and `destination` like "af%"
AND ROUND(sum(duration),2) like "3%"
group by `destination`
ORDER BY duration Asc LIMIT 0, 20;
let me know the work around
The WHERE clause is evaluated before grouping takes place, so SUM() cannot be used therein; use the HAVING clause instead, which is evaluated after grouping:
SELECT destination,
ROUND(SUM(duration), 2) AS total_duration,
SUM(calls) AS total_calls
FROM ilax.margins
WHERE date1 = '2013-08-30'
AND destination LIKE 'af%'
GROUP BY destination
HAVING total_duration LIKE '3%'
ORDER BY total_duration ASC
LIMIT 0, 20
Note also that one really ought to use numeric comparison operations for numeric values, rather than string pattern matching. For example:
HAVING total_duration >= 3000 AND total_duration < 4000