MySQL: Aggregate weekly statistics with substatics (subqueries) - mysql

I would like to gather weekly statics on a MySQL-Table.
The table itself has the following structure:
user_id action_id created
0 123 2017-01-01 00.00:00
0 124 ...
1 123 ...
... ... ...
I would like to aggregate the weekly statics for:
How many user where active per week
This is rather simple:
SELECT
YEARWEEK(created) as week,
COUNT(DISTINCT user_id) AS count
FROM data
GROUP BY YEARWEEK(created);
Additionally I could apply a sorting.
The result looks like:
week count
201701 2
201702 3
How many user where active per week for the very first time
I thought about solving it by using a subquery
SELECT
YEARWEEK(created) as week,
COUNT(DISTINCT user_id) AS count,
(
SELECT
COUNT(DISTINCT d2.user_id)
FROM data d2
WHERE YEARWEEK(d2.created) = week
AND NOT EXISTS (SELECT 1 FROM data d3
WHERE YEARWEEK(d3.created) < week AND d2.user_id = d3.user_id)
) as countNewUsers
FROM data d1
GROUP BY YEARWEEK(created);
How many junior user where active per week
Junior users were active between 1 and 10 times before the related week
Similar to the one above, but with other subquery
How many power user where active per week
Senior users were active more than 10 times before the related week
This works as expected, but has a rather poor performance, since the subquery is evaluated before the grouping happens. With millions of rows in a table, this takes ages.
Does anybody have a better solution for this query, ideally returning all values in single result set?

I think all of your queries could derive from one 'intermediate' table. It would contain (yearweek, userid, count).
Users active per week: Pretty much the same query, but faster from this table.
Active for first time: Self-join ON userid and desired week versus MIN(yearweek)
Uses before the target week: ... SUM(count) WHERE ... < week GROUP BY userid
Use the above to determine which userids of Junior/Power.

Related

Get amount of active user of the last n days grouped by date

Suppose I have a Hive table logins with the following columns:
user_id | login_timestamp
I'm now interested in getting some activity KPIs. For instance, daily active user:
SELECT
to_date(login_timestamp) as date,
COUNT(DISTINCT user_id) daily_active_user
FROM
logins
GROUP BY to_date(login_timestamp)
ORDER BY date asc
Changing it from daily active to weekly/monthly active is not a great deal because I can just exchange the to_date() function to get the month and then group by that value.
What I now want to get is the distinct amount of user who were active in the last n days (e.g. 3) grouped by date. Additionally, what I'm looking for is a solution that works for a variable time window and not only for one day (getting the amount of active user of the last 3 days on day x only would be easy).
The result is supposed to like somewhat like this:
date, 3d_active_user
2017-12-01, 111
2017-12-02, 234
2017-12-03, 254
2017-12-04, 100
2017-12-05, 103
2017-12-06, 103
2017-12-07, 230
Using a subquery in the first select (e.g. select x, (select max(x) from x) as y from z) building a workaround for the moving time window is not possible because it is not supported by the Hive version I'm using.
I tried my luck something like COUNT(DISTINCT IF(DATEDIFF(today,login_date)<=3,user_id,null)) but everything I tried so far is not working.
Do you have any idea on how to solve this issue?
Any help appreciated!
You can user "BETWEEN" function.
If you want to find the active users, log in from the particular date to till now.
SELECT to_date(login_timestamp) as date,COUNT(DISTINCT user_id) daily_active_user
FROM logins
WHERE login_timestamp BETWEEN startDate_timeStamp AND now()
GROUP BY to_date(login_timestamp)
ORDER BY date asc
If you want the active users, who are log in users for specific date range then:
NOTE:-
SELECT to_date(login_timestamp) as date,COUNT(DISTINCT user_id) daily_active_user
FROM logins
WHERE login_timestamp BETWEEN to_date(startDate_timeStamp) AND to_date(endDate_timeStamp)
GROUP BY to_date(login_timestamp)
ORDER BY date asc

SQL Get first row of each unique ID AND each row with that ID within x time after the first

I was wondering if it is possible to have a single SQL command that returns each unique userID from a table as well as the rows containing that userID within the next 24 hours.
So for example I may have a table structured like:
id | userID | action | date
with a bunch of rows with thousands of unique userIDs, dozens of different actions, and dates. I am basically interested in what actions each userID does within the first 24 hours, but this for all users.
So I should get maybe 10-15 different actions for each userID, and each userID will be signing up on different days, months, or even years so it's not just grabbing all actions over a specific 24 hour period.
If I understand your question correctly, you could use a query like this:
SELECT a1.*
FROM
actions a1 INNER JOIN (SELECT userID, MIN(date) first_action
FROM actions
GROUP BY userID) a2
ON a1.userID = a2.userID AND a1.date <= first_action + INTERVAL 24 HOUR
Please see fiddle here. This query will return all actions that each user does in the first 24 hours after their first action.

Number of Posts as per days in a month

There is a table Post in my database which contains posts of different users. What I wanna do is to create an sql query that'll return as per respective month the number of posts being made each day. Kindly let me know how can i do that generically in one query i can create multiple queries for all days but that is a worst case scenario. So I need expert's solution to this.
Thanks
Expected output:
(Query counts the number of posts for all the days in a respective month)
Day : Number of posts
1 : 20
2 : 25
3 : 10
4 : 17
.........................
30 : 6
Table Structure:
ID | postid | post | date
select DAYOFMONTH(date) as Day , count(*) as Number_of_posts
from table
group by DAYOFMONTH(date)
You should know that if table contains data from different months number of posts will be wrong.
So the group by should be by date and you should use date in selected instead of day of month.
SELECT DAYOFMONTH(date), count(*) FROM Post
GROUP BY DAYOFMONTH(date)
ORDER BY DAYOFMONTH(date) ASC;
If you want to query for a specific month (say, February) then use this:
SELECT DAYOFMONTH(date), count(*) FROM Post
WHERE MONTH(date) = '2'
GROUP BY DAYOFMONTH(date)
ORDER BY DAYOFMONTH(date) ASC;
Note: Months are returned in number form where the MONTH() function is used.
EDIT: If you're looking to return counts for EVERY day in a given month, then I'd push you here - a great accepted answer to a similar question: How to get values for every day in a month
SELECT date, COUNT(id) as number_of_posts FROM table_name GROUP BY date.

How to deal with counting items by date in MySQL when the count for a given date increment is 0?

I'm looking to make some bar graphs to count item sales by day, month, and year. The problem that I'm encountering is that my simple MySQL queries only return counts where there are values to count. It doesn't magically fill in dates where dates don't exist and item sales=0. This is causing me problems when trying to populate a table, for example, because all weeks in a given year aren't represented, only the weeks where items were sold are represented.
My tables and fields are as follows:
items table: account_id and item_id
// table keeping track of owners' items
items_purchased table: purchaser_account_id, item_id, purchase_date
// table keeping track of purchases by other users
calendar table: datefield
//table with all the dates incremented every day for many years
here's the 1st query I was referring to above:
SELECT COUNT(*) as item_sales, DATE(purchase_date) as date
FROM items_purchased join items on items_purchased.item_id=items.item_id
where items.account_id=125
GROUP BY DATE(purchase_date)
I've read that I should join a calendar table with the tables where the counting takes place. I've done that but now I can't get the first query to play nice this 2nd query because the join in the first query eliminates dates from the query result where item sales are 0.
here's the 2nd query which needs to be merged with the 1st query somehow to produce the results i'm looking for:
SELECT calendar.datefield AS date, IFNULL(SUM(purchaseyesno),0) AS item_sales
FROM items_purchased join items on items_purchased.item_id=items.item_id
RIGHT JOIN calendar ON (DATE(items_purchased.purchase_date) = calendar.datefield)
WHERE (calendar.datefield BETWEEN (SELECT MIN(DATE(purchase_date))
FROM items_purchased) AND (SELECT MAX(DATE(purchase_date)) FROM items_purchased))
GROUP BY date
// this lists the sales/day
// to make it per week, change the group by to this: GROUP BY week(date)
The failure of this 2nd query is that it doesn't count item_sales by account_id (the person trying to sell the item to the purchaser_account_id users). The 1st query does but it doesn't have all dates where the item sales=0. So yeah, frustrating.
Here's how I'd like the resulting data to look (NOTE: these are what account_id=125 has sold, other people many have different numbers during this time frame):
2012-01-01 1
2012-01-08 1
2012-01-15 0
2012-01-22 2
2012-01-29 0
Here's what the 1st query current looks like:
2012-01-01 1
2012-01-08 1
2012-01-22 2
If someone could provide some advice on this I would be hugely grateful.
I'm not quite sure about the problem you're getting as I don't know the actual tables and data they contain that generates those results (that would help a lot!). However, let's try something. Use this condition:
where (items.account_id = 125 or items.account_id is null) and (other-conditions)
Your first query is perfectly acceptable. The fact is you don't have data in the mysql table and therefore it can't group any data together. This is fine. You can account for this in your code so that if the date does not exist, then obviously there's no data to graph. You can better account for this by ordering the date value so you can loop through it accordingly and look for missed days.
Also, to avoid doing the DATE() function, you can change the GROUP BY to GROUP BY date (because you have in your fields selected DATE(pruchase_date) as date)

Group results by period

I have some data which I want to retrieve, but I want to have it grouped by a specific number of seconds. For example if my table looks like this:
| id | user | pass | created |
The created column is INT and holds a timestamp (number of seconds from 1970).
I would want the number of users that are created between last month and the current date, but show them grouped by let's say 7*24*3600 (a week). So if in the range there are 1000 new users, have them show up how many registered each week (100 the first week, 450 the second, 50 the third and 400 the 4th week -- something like this).
I've tried grouping the results by created / 7*24*3600, but that's not working.
How should my query look like?
You need to use integer division div otherwise the result will turn into a real and none of the weeks will resolve to the same value.
SELECT
(created div (7*24*60*60)) as weeknumber
, count(*) as NewUserCount
FROM users
WHERE weeknumber > 1
GROUP BY weeknumber
See: http://dev.mysql.com/doc/refman/5.0/en/arithmetic-functions.html
You've got to keep the integer part only of that division. You can do it with the floor() function.
Have you tried select floor(created/604800) as week_no, count(*) from users group by floor(created/604800) ?
I assume you've got the "select users created in the last month" part sorted out.
Okay here are the possible options you may try:
GROUP BY DAY
select count(*), DATE_FORMAT(created_at,"%Y-%m-%d") as created_day FROM widgets GROUP BY created_day
GROUP BY MONTH
select count(*), DATE_FORMAT(created_at,"%Y-%m") as created_month FROM widgets GROUP BY created_month
GROUP BY YEAR
select count(*), DATE_FORMAT(created_at,"%Y") as created_year FROM widgets GROUP BY created_year