Need help trying to rank grouped values by month - mysql

I have a dataset that looks like this (there are other columns, but these are the relevant ones):
**Month | Revenue | Segment**
01/01/2016 | $40,000 | Seg XYZ
I'm trying to assign a rank to each segment based on revenue by month. So if there is a second row for January of $30,000 for a different segment, I would have a rank of 1 for the first row, and a 2 for the second row. This will repeat across segments and months.
This is my query:
Select
a.*
<br>,case when #d=month then #r:=#r+1
<br>else #r:=1 end as rank
<br>,#d:=month as stuff
<br>from
<br>(select month
<br>,sum(Revenue) as Revenue
<br>,segment from final_audience_usage_file
<br>group by month, segment) a
<br>join
<br>(select #r:=0,#d:=0)d
<br>order by month, revenue desc
I keep getting this error: Data truncated: Incorrect date value: '0' for column 'month' at row 1
I checked the data, and there are no zero's, but the fact that it's in the first row leads me to believe that something that I'm doing is changing that to zero. Any thoughts?

You have to change the value that variable #d is initialized to. So instead of:
(select #r:=0,#d:=0)
use:
(select #r:=0,#d:='1900-01-01')
My guess is that date '1900-01-01' is not used in your table. This way the comparison of the CASE expression can be executed without any problems.

Related

Pyspark subquery on the same table

I have some data that has year, month, date, column_x. The column_x can be missing or not missing. What I want to generate is the missing rate of column_x. In order to do so, I'm trying to create two columns that contains the total row number, which would be total_count, and count column, that represents the column_x == null.
I'm trying to create something like below:
total_count | count | year | month | date
60 | 20 | 2022 | 12 | 01
so I can do in future count / total_count to get some percentage.
However, I'm not sure how I can generate a query.
I tried subqueries but it's throwing me an error.. how can I achieve this through pyspark or sql subqueries? (I can register temp table and run sql queries as well)
What I want to generate is the missing rate of column_x.
You can do conditional counts. In MySQL:
select year, month, day,
count(*) as cnt_total,
count(column_x) as cnt_x_not_null,
sum(column_x is null) as cnt_x_null,
avg(column_x is null) as ratio_x_null
from mytable
group by year, month, day
The last expression (avg) gives you the ratio of rows where the column is null. This works because MySQL evaluates conditions as 0/1 in numeric context, so we can just use avg on top of the is null predicate.
Other columns in the resultset give more examples of conditional counts.

How can i find minimum count in a given date range

I'm trying to allocate workers for a job for a specific date range and wanted to find out the minimum number of workers allocated for the given date range.
For example, my table contains
startDate endDate No.of.Workers
--------- --------- ---------------
1-1-2019 10-1-2019 1
11-1-2019 20-1-2019 1
now, i wanted to find out the minimum no of workers working in date range 1-1-2019 to 20-1-2019.
The output should be 1.
Suppose my table looks like,
startDate endDate No.of.Workers
--------- --------- ---------------
1-1-2019 10-1-2019 1
11-1-2019 20-1-2019 1
11-1-2019 15-1-2019 1
The output should be 2.
Is there any query for this in sql or i need to write an algorithm?
I am using mysql database.
You can get the number of workers needed by splitting the data, aggregating and using cumulative sums:
with dtes as (
select startDate as dte, numworks
from t
union all
select endDate as dte, - numworks
from t
)
select dte, sum(numworks),
sum(sum(numworks)) over (order by dte) as needed
from dtes
group by dte
order by dte;
To get the maximum, you can do something like this:
select dte, sum(numworks),
sum(sum(numworks)) over (order by dte) as needed
from dtes
group by dte
order by needed desc
fetch first 1 row only;
You don't specify the database, fetch first is ISO/ANSI standard SQL.
Also, it is not clear if the end date counts as one of the days. This can affect the results. If it is included, then you need to add one day to the "endDate" part of the logic. How you do that depends on your database.

Get the max value of different sum values in sql

I have a table called "Sold_tickets" with attributes "Ticket_id" and "Date_sold". I want to find the day when the most tickets have been sold and the amount of tickets that were sold.
ticket_id date_sold
1 2017-02-15
2 2017-02-15
3 2017-02-14
In this case I want my output to look like this:
date_sold amount
2017-02-15 2
I know you can use a query like this
SELECT Count(ticket_id)
FROM Sold_tickets
WHERE date_sold = '2017-02-15';
to get an output of 2. The same can of course be done for 2017-02-14 to get an output of 1. However, then I have to manually check all the dates and compare them myself. Does a function exist (in sqlite) that counts the tickets sold for all the dates and then shows you only the maximum value?
Try using a GROUP BY aggregation query, then retain only the record having the maximum number of sales.
SELECT date_sold, COUNT(*)
FROM Sold_tickets
GROUP BY date_sold
ORDER BY COUNT(*) DESC
LIMIT 1
This solution would work well assuming that you don't have two or more dates tied for the greatest number of sales, or, if there is a tie, that you don't mind choosing just one date group.

Counting all rows with specific columns and grouping by week

I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5
2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.
Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.
Consider the following 2 rows (with timestamp in date format for ease of reading):
TimeStamp StartDate
20120601 20120601
20120701 20120601
Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.
HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:
TimeStamp COUNT
20120601 2
and a 50:50 chance of getting
TimeStamp COUNT
20120701 2
If you add more data to the dataset as so:
TimeStamp StartDate
20120601 20120601
20120701 20120601
20120701 20120701
You could get
TimeStamp COUNT
20120601 2
20120701 1
or
TimeStamp COUNT
20120701 2
20120701 1
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
EDIT
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
SELECT DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.
Try this
SELECT DATE(timestamp), COUNT(week(date(startdate))) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))

Group results by period

I have some data which I want to retrieve, but I want to have it grouped by a specific number of seconds. For example if my table looks like this:
| id | user | pass | created |
The created column is INT and holds a timestamp (number of seconds from 1970).
I would want the number of users that are created between last month and the current date, but show them grouped by let's say 7*24*3600 (a week). So if in the range there are 1000 new users, have them show up how many registered each week (100 the first week, 450 the second, 50 the third and 400 the 4th week -- something like this).
I've tried grouping the results by created / 7*24*3600, but that's not working.
How should my query look like?
You need to use integer division div otherwise the result will turn into a real and none of the weeks will resolve to the same value.
SELECT
(created div (7*24*60*60)) as weeknumber
, count(*) as NewUserCount
FROM users
WHERE weeknumber > 1
GROUP BY weeknumber
See: http://dev.mysql.com/doc/refman/5.0/en/arithmetic-functions.html
You've got to keep the integer part only of that division. You can do it with the floor() function.
Have you tried select floor(created/604800) as week_no, count(*) from users group by floor(created/604800) ?
I assume you've got the "select users created in the last month" part sorted out.
Okay here are the possible options you may try:
GROUP BY DAY
select count(*), DATE_FORMAT(created_at,"%Y-%m-%d") as created_day FROM widgets GROUP BY created_day
GROUP BY MONTH
select count(*), DATE_FORMAT(created_at,"%Y-%m") as created_month FROM widgets GROUP BY created_month
GROUP BY YEAR
select count(*), DATE_FORMAT(created_at,"%Y") as created_year FROM widgets GROUP BY created_year