Pyspark subquery on the same table

Pyspark subquery on the same table - mysql

I have some data that has year, month, date, column_x. The column_x can be missing or not missing. What I want to generate is the missing rate of column_x. In order to do so, I'm trying to create two columns that contains the total row number, which would be total_count, and count column, that represents the column_x == null.
I'm trying to create something like below:
total_count | count | year | month | date
60 | 20 | 2022 | 12 | 01
so I can do in future count / total_count to get some percentage.
However, I'm not sure how I can generate a query.
I tried subqueries but it's throwing me an error.. how can I achieve this through pyspark or sql subqueries? (I can register temp table and run sql queries as well)

What I want to generate is the missing rate of column_x.
You can do conditional counts. In MySQL:
select year, month, day,
count(*) as cnt_total,
count(column_x) as cnt_x_not_null,
sum(column_x is null) as cnt_x_null,
avg(column_x is null) as ratio_x_null
from mytable
group by year, month, day
The last expression (avg) gives you the ratio of rows where the column is null. This works because MySQL evaluates conditions as 0/1 in numeric context, so we can just use avg on top of the is null predicate.
Other columns in the resultset give more examples of conditional counts.

Related

Need help trying to rank grouped values by month

I have a dataset that looks like this (there are other columns, but these are the relevant ones):
**Month | Revenue | Segment**
01/01/2016 | $40,000 | Seg XYZ
I'm trying to assign a rank to each segment based on revenue by month. So if there is a second row for January of $30,000 for a different segment, I would have a rank of 1 for the first row, and a 2 for the second row. This will repeat across segments and months.
This is my query:
Select
a.*
<br>,case when #d=month then #r:=#r+1
<br>else #r:=1 end as rank
<br>,#d:=month as stuff
<br>from
<br>(select month
<br>,sum(Revenue) as Revenue
<br>,segment from final_audience_usage_file
<br>group by month, segment) a
<br>join
<br>(select #r:=0,#d:=0)d
<br>order by month, revenue desc
I keep getting this error: Data truncated: Incorrect date value: '0' for column 'month' at row 1
I checked the data, and there are no zero's, but the fact that it's in the first row leads me to believe that something that I'm doing is changing that to zero. Any thoughts?

You have to change the value that variable #d is initialized to. So instead of:
(select #r:=0,#d:=0)
use:
(select #r:=0,#d:='1900-01-01')
My guess is that date '1900-01-01' is not used in your table. This way the comparison of the CASE expression can be executed without any problems.

MySQL Date in where clause

I have a table which contains date (Field Type: Date and Date Format: %Y-%m-%d) as a field. I need to select all the rows from the table for all the years whose date is not between Dec 3rd and Dec 24th.
The table contains month and day as a separate fields.
The result can be obtained by using the following query:
select * from mytable where date not in (select date from mytable where month=12 and day between 3 and 24);
But i m trying to get the result in a single query like the below one but it gave empty rows:
select * from mytable where date not between '%Y-12-03' and '%Y-12-24';
Can it be done in a single query like the above one?

SELECT *
FROM mytable
WHERE MONTH(`date`) <> 12
OR DAY(`date`) NOT BETWEEN 3 AND 24
;
This will give you every row that meets the requirements. I'm sure someone has a faster way of doing this, since this will ignore all indexes and will likely be slow on a large dataset, but it does work and return the data you require, so if no-one can suggest an improvement this will answer your question.

Stop query from skipping over null values

I have a query that shows me the number of calls per day for the last 14 days within my app.
The query:
SELECT count(id) as count, DATE(FROM_UNIXTIME(timestamp)) as date FROM calls GROUP BY DATE(FROM_UNIXTIME(timestamp)) DESC LIMIT 14
On days where there were 0 calls, this query does not show those days. Rather than skip those days, I'd like to have a 0 or NULL in that spot.
Any ideas for how I can achieve this? If you have any questions as to what I'm asking please let me know.
Thanks

I don't believe your query is "skipping over NULL values", as your title suggests. Rather, your data probably looks something like this:
id | timestamp
----+------------
1 | 2014-01-01
2 | 2014-01-02
3 | 2014-01-04
As a result, there are no rows that contain the missing date, so there are no rows to be counted. The answer is that you need to generate a list of all the dates you want and then do a LEFT or RIGHT JOIN to it.
Unfortunately, MySQL doesn't make this as easy as other databases. There doesn't seem to be an effective way of generating a list of anything inline. So you'll need some sort of table.
I think I would create a static table containing a set of integers to be subtracted from the current date. Then you can use this table to generate your list of dates inline and JOIN to it.
CREATE TABLE days_ago_list (days_ago INTEGER);
INSERT INTO days_ago_list VALUES
(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13)
;
Then:
SELECT COUNT(id), list_date
FROM (SELECT SUBDATE(CURDATE(), days_ago) AS list_date FROM days_ago_list) dates_to_list
LEFT JOIN (SELECT id, DATE(FROM_UNIXTIME(timestamp)) call_date FROM calls) calls_with_date
ON calls_with_date.call_date = dates_to_list.list_date
GROUP BY list_date
It is very important that you group by list_date; call_date will be NULL for any days without calls. It is also important to COUNT on id since NULL ids will not be counted. (That ensures you get a correct count of 0 for days with no calls.) If you need to change the dates listed, you simply update the table containing the integer list.
Here is a SQL Fiddle demonstrating this.
Alternatively, if this is for a web application, you could generate the list of dates code side and match up the counts with the dates after the query is done. This would make your web app logic somewhat more complicated, but it would also simplify the query and eliminate the need for the extra table.

create a table that contains a row for each date you want to ensure is in the results, left outer join with results of your current query, use temp table's date, count of above query and 0 if that count is null

Counting all rows with specific columns and grouping by week

I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:
SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))
Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)
If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:
2006-01-01 | 5
2006-01-08 | 10
so that the day timestamp is the first column and second is the amount of rows per week.

Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.
Consider the following 2 rows (with timestamp in date format for ease of reading):
TimeStamp StartDate
20120601 20120601
20120701 20120601
Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.
HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:
TimeStamp COUNT
20120601 2
and a 50:50 chance of getting
TimeStamp COUNT
20120701 2
If you add more data to the dataset as so:
TimeStamp StartDate
20120601 20120601
20120701 20120601
20120701 20120701
You could get
TimeStamp COUNT
20120601 2
20120701 1
or
TimeStamp COUNT
20120701 2
20120701 1
You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!
EDIT
Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):
SELECT DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart
You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.

Try this
SELECT DATE(timestamp), COUNT(week(date(startdate))) FROM `table` WHERE ( date(timestamp)
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X )
group by week(date(startdate))

Group results by period

I have some data which I want to retrieve, but I want to have it grouped by a specific number of seconds. For example if my table looks like this:
| id | user | pass | created |
The created column is INT and holds a timestamp (number of seconds from 1970).
I would want the number of users that are created between last month and the current date, but show them grouped by let's say 7*24*3600 (a week). So if in the range there are 1000 new users, have them show up how many registered each week (100 the first week, 450 the second, 50 the third and 400 the 4th week -- something like this).
I've tried grouping the results by created / 7*24*3600, but that's not working.
How should my query look like?

You need to use integer division div otherwise the result will turn into a real and none of the weeks will resolve to the same value.
SELECT
(created div (7*24*60*60)) as weeknumber
, count(*) as NewUserCount
FROM users
WHERE weeknumber > 1
GROUP BY weeknumber
See: http://dev.mysql.com/doc/refman/5.0/en/arithmetic-functions.html

You've got to keep the integer part only of that division. You can do it with the floor() function.
Have you tried select floor(created/604800) as week_no, count(*) from users group by floor(created/604800) ?
I assume you've got the "select users created in the last month" part sorted out.

Okay here are the possible options you may try:
GROUP BY DAY
select count(*), DATE_FORMAT(created_at,"%Y-%m-%d") as created_day FROM widgets GROUP BY created_day
GROUP BY MONTH
select count(*), DATE_FORMAT(created_at,"%Y-%m") as created_month FROM widgets GROUP BY created_month
GROUP BY YEAR
select count(*), DATE_FORMAT(created_at,"%Y") as created_year FROM widgets GROUP BY created_year

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Pyspark subquery on the same table - mysql

Related

Need help trying to rank grouped values by month

MySQL Date in where clause

Stop query from skipping over null values

Counting all rows with specific columns and grouping by week

Group results by period

Categories

Resources