Looking for daily user activity from multiple tables - mysql

I want to find the Daily Active Users, which in each application differs how these are calculated. In my case, I have multiple tables where a user could have had an activity.
I've been able to do a LEFT JOIN in one of the tables, but I don't know how to incorporate the rest of the tables to get the activity that happened the last 30 days.
SELECT
DATE_FORMAT(user_video_plays.created_at, '%Y-%m-%d') AS date,
count(*)
FROM
`users`
INNER JOIN `subscriptions` ON `users`.`id` = `subscriptions`.`user_id`
LEFT JOIN `user_video_plays` ON `users`.`id` = `user_video_plays`.`user_id`
WHERE
`users`.`deleted_at` IS NULL
AND `subscriptions`.`chargebee_status` <> 'cancelled'
AND `user_video_plays`.`created_at` BETWEEN '2022-10-01 00:00:00' AND '2022-10-31 23:59:59'
GROUP BY
DATE_FORMAT(user_video_plays.created_at, '%Y-%m-%d')
I have 2 more tables where the user could have activity: forum_posts and forum_post_replies. How can I incorporate them into my query so I get the activity grouped by day?
I've prepared a DB fiddle with the structure and some sample data, as well as my query: https://www.db-fiddle.com/f/ppRaWP7SPDURm8dePyAkEr/0
Thank you
UPDATE 1: Looking at #Luuk answer, I realized that also somehow we need to make this unique. In the following fiddle, I've simplified the data but user_video_plays have 3 plays from the same user and that shouldn't count as 3 but one: https://dbfiddle.uk/ZszSND-H - I think this is easy on my single table query, with a unique, but I should have this into consideration with the 3 extra tables.

I have added forum_posts:
SELECT
DATE_FORMAT(user_video_plays.created_at, '%Y-%m-%d') AS date,
count(*) countUsers,
count(`user_video_plays`.`user_id`) videoPlays,
count(`forum_posts`.`user_id`) forumPosts
FROM
`users`
INNER JOIN `subscriptions` ON `users`.`id` = `subscriptions`.`user_id`
LEFT JOIN `user_video_plays` ON `users`.`id` = `user_video_plays`.`user_id`
AND `user_video_plays`.`created_at` BETWEEN '2022-10-01 00:00:00' AND '2022-10-31 23:59:59'
LEFT JOIN `forum_posts` ON `users`.`id` = `forum_posts`.`user_id`
AND `forum_posts`.`created_at` BETWEEN '2022-10-01 00:00:00' AND '2022-10-31 23:59:59'
WHERE
`users`.`deleted_at` IS NULL
AND `subscriptions`.`chargebee_status` <> 'cancelled'
GROUP BY
DATE_FORMAT(user_video_plays.created_at, '%Y-%m-%d')
NOTE: I moved AND user_video_plays.created_at BETWEEN .... from the WHERE-clause to the ON-clause of the LEFT JOIN.
for the output, see: DBFIDDLE
Can you can do the other table yourself, following this example?

Related

SQL Count on JOIN query is taking forever to execute?

I'm trying to run count query on a 2 table join. e_amazing_client table is having million entries/rows and m_user has just 50 rows BUT count query is taking forever!
SELECT COUNT(`e`.`id`) AS `count`
FROM `e_amazing_client` AS `e`
LEFT JOIN `user` AS `u` ON `e`.`cx_hc_user_id` = `u`.`id`
WHERE ((`e`.`date_created` >= '2018-11-11') AND (`e`.`date_created` >= '2018-11-18')) AND (`e`.`id` >= 1)
I don't know what is wrong with this query?
First, I'm guessing that this is sufficient:
SELECT COUNT(*) AS `count`
FROM e_amazing_client e
WHERE e.date_created >= '2018-11-11' AND e.id >= 1;
If user has only 50 rows, I doubt it is creating duplicates. The comparisons on date_created are redundant.
For this query, try creating an index on e_amazing_client(date_created, id).
Maybe you wanted this:
SELECT COUNT(`e`.`id`) AS `count`
FROM `e_amazing_client` AS `e`
LEFT JOIN `user` AS `u` ON `e`.`cx_hc_user_id` = `u`.`id`
WHERE ((`e`.`date_created` >= '2018-11-11') AND (`e`.`date_created` <= '2018-11-18')) AND (`e`.`id` >= 1)
to check between dates?
Also, do you really need
AND (`e`.`id` >= 1)
If id is what an id is usually in a table, is there a case to be <1?
Your query is pulling ALL records on/after 2018-11-11 because your WHERE clause is ID >= 1 You have no clause in there for a specific user. You also had in your original query based on a date of >= 2018-11-18. You MAY have meant you only wanted the count WITHIN the week 11/11 to 11/18 where the sign SHOULD have been >= 11-11 and <= 11-18.
As for the count, you are getting ALL people (assuming no entry has an ID less than 1) and thus a count within that date range. If you want it per user as you indicated you need to group by the cx_hc_user_id (user) column to see who has the most, or make the user part of the WHERE clause to get one person.
SELECT
e.cx_hc_user_id,
count(*) countPerUser
from
e_amazing_client e
WHERE
e.date_created >= '2018-11-11'
AND e.date_created <= '2018-11-18'
group by
e.cx_hc_user_id
You can order by the count descending to get the user with the highest count, but still not positive what you are asking.

MySQL funnel multiple ANDs

With the below MySQL query, I would like to match where page is both /signup and then later down in the userflow /confirm
SELECT COUNT(*) as `total` FROM (
SELECT COUNT(DISTINCT t.user_id) AS `visitors`
FROM `tracks` t
JOIN `user_details` u ON u.id=t.user_id AND u.site_id=t.site_id
WHERE t.site_id='334565'
AND (t.page = '/signup' AND t.page = '/confirm')
AND t.timestamp BETWEEN '2015-01-23 00:00:00' AND '2015-04-30 23:59:59'
GROUP BY t.user_id, t.track_id
) as a
The main problem with this query, is that MySQL doesn't work the way I'm trying to use it (incorrectly).
The other problem is that the returned order would potentially be incorrect, so also needs to be in the specified order.
Maybe this query needs to be done completely differently, but I'm not sure I'm on the right track.
Has anyone done this before or is there a better way to get the job done?
Please note that the above WHERE clause could match more than just page and could be anything such as t.referrer or u.somethingelse
Another example would be:
SELECT COUNT(*) as `total` FROM (
SELECT COUNT(DISTINCT t.user_id) AS `visitors`
FROM `tracks` t
JOIN `user_details` u ON u.id=t.user_id AND u.site_id=t.site_id
WHERE t.site_id='334565'
AND (u.browser = 'chrome' AND t.referrer_host = 'google.com' AND t.page = '/confirm' and t.page = '/preferences')
AND t.timestamp BETWEEN '2015-01-23 00:00:00' AND '2015-04-30 23:59:59'
GROUP BY t.user_id, t.track_id
) as a
Each of the u.browser, t.referrer_host, t.page are goals and I am trying to show them all together as a funnel. Kind of how an analytics program would do it.
I'm assuming this is tracking visitors to web pages (not a tough assumption to make), with each url / page endpoint having its own entry in the tracking table.
In order to find users who have hit both pages, you need to join the tracking table to itself. Something like this:
SELECT COUNT(DISTINCT t1.user_id) AS `visitors`
FROM `tracks` t1
JOIN `user_details` u ON u.id=t1.user_id AND u.site_id=t1.site_id
join `tracks` t2 on t1.site_id = t2.site_id and u.id = t2.user_id and t1.track_id <> t2.track_id
WHERE t1.site_id='334565'
AND (t1.page = '/signup' AND t2.page = '/confirm')
AND t1.timestamp BETWEEN '2015-01-23 00:00:00' AND '2015-04-30 23:59:59'
I don't think there's any need for grouping, as I think you just want the distinct number of visitors that have signed up, and then confirmed.

Speed up MySql query time with multiple conditional joins

There are 3 tables, persontbl1, persontbl2 (each 7500 rows) and schedule (~3000 active schedules i.e. schedule.status = 0). Person tables contain data for the same persons as one to one relationship and INNER join between two takes less than a second. And schedule table contains data about persons to be interviewed and not all persons have schedules in schedule table. With Left join query instantly takes around 45 seconds, which is causing all sorts of issues.
SELECT persontbl1._CREATION_DATE, persontbl2._TOP_LEVEL_AURI,
persontbl2.RESP_CNIC, persontbl2.RESP_CNIC_NAME,
persontbl1.MOB_NUMBER1, persontbl1.MOB_NUMBER2,
schedule.id, schedule.call_datetime, schedule.enum_id,
schedule.enum_change, schedule.status
FROM persontbl1
INNER JOIN persontbl2 ON (persontbl2._TOP_LEVEL_AURI = persontbl1._URI)
AND (AGR_CONTACT=1)
LEFT JOIN SCHEDULE ON (schedule.survey_id = persontbl1._URI)
AND (SCHEDULE.status=0)
AND (DATE(SCHEDULE.call_datetime) <= CURDATE())
ORDER BY schedule.call_datetime IS NULL DESC, persontbl1._CREATION_DATE ASC
Here is the explain for query:
Schedule Table structure:
Schedule Table indexes:
Please let me know if any further information is required.
Thanks.
Edit: Added fully qualified table names and their columns.
You should just replace this line:
AND (DATE(SCHEDULE.call_datetime) <= CURDATE())
to this one:
AND SCHEDULE.call_datetime <= '2015-04-18 00:00:00'
so mysql will not call 2 functions per every record but will use static constant '2015-04-18 00:00:00'.
So you can just try for performance improvements if your query is:
SELECT persontbl1._CREATION_DATE, persontbl2._TOP_LEVEL_AURI,
persontbl2.RESP_CNIC, persontbl2.RESP_CNIC_NAME,
persontbl1.MOB_NUMBER1, persontbl1.MOB_NUMBER2,
schedule.id, schedule.call_datetime, schedule.enum_id,
schedule.enum_change, schedule.status
FROM persontbl1
INNER JOIN persontbl2 ON (persontbl2._TOP_LEVEL_AURI = persontbl1._URI)
AND (AGR_CONTACT=1)
LEFT JOIN SCHEDULE ON (schedule.survey_id = persontbl1._URI)
AND (SCHEDULE.status=0)
AND (SCHEDULE.call_datetime <= '2015-02-01 00:00:00')
ORDER BY schedule.call_datetime IS NULL DESC, persontbl1._CREATION_DATE ASC
EDIT 1 So you said without LEFT JOIN part it was fast enough, so you can try then:
SELECT persontbl1._CREATION_DATE, persontbl2._TOP_LEVEL_AURI,
persontbl2.RESP_CNIC, persontbl2.RESP_CNIC_NAME,
persontbl1.MOB_NUMBER1, persontbl1.MOB_NUMBER2,
s.id, s.call_datetime, s.enum_id,
s.enum_change, s.status
FROM persontbl1
INNER JOIN persontbl2 ON (persontbl2._TOP_LEVEL_AURI = persontbl1._URI)
AND (AGR_CONTACT=1)
LEFT JOIN
(SELECT *
FROM SCHEDULE
WHERE status=0
AND call_datetime <= '2015-02-01 00:00:00'
) s
ON s.survey_id = persontbl1._URI
ORDER BY s.call_datetime IS NULL DESC, persontbl1._CREATION_DATE ASC
I'm guessing that AGR_CONTACT comes from p1. This is the query you want to optimize:
SELECT p1._CREATION_DATE, _TOP_LEVEL_AURI, RESP_CNIC, RESP_CNIC_NAME,
MOB_NUMBER1, MOB_NUMBER2,
s.id, s.call_datetime, s.enum_id, s.enum_change, s.status
FROM persontbl1 p1 INNER JOIN
persontbl2 p2
ON (p2._TOP_LEVEL_AURI = p1._URI) AND (p1.AGR_CONTACT = 1) LEFT JOIN
SCHEDULE s
ON (s.survey_id = p1._URI) AND
(s.status = 0) AND
(DATE(s.call_datetime) <= CURDATE())
ORDER BY s.call_datetime IS NULL DESC, p1._CREATION_DATE ASC;
The best indexes for this query are: persontbl2(agr_contact), persontbl1(_TOP_LEVEL_AURI, _uri), and schedule(survey_id, status, call_datime).
The use of date() around the date time is not recommended. In general, that precludes the use of indexes. However, in this case, you have a left join, so it doesn't make a difference. That column is not being used for filtering anyway. The index on schedule is only for covering the on clause.

How to group results by hour, which includes hours with no record

I have a table which contains all orders, i'm trying to separate orders by hour.
if there is no record for an specific hour, the query will ignore that hour, but what i'm trying to achieve is to report'0' for that hour.
I also joined the table with a temporary table containing all hours.
SELECT sum(orders.price), hour(orders.time) as hour
FROM orders
RIGHT JOIN dummy_time as dummy
ON hour(orders.time) = dummy.time
WHERE state = 1
AND (date(orders.time) = '2014-06-17' or orders.time is null)
GROUP BY hour
You can view my query in SQLFiddle
To get all rows from dummy_time, move your conditions from your WHERE to your RIGHT JOIN. Also, select the hour from dummy.time so you will get all hours.
Use COALESCE to get values of 0 where an order doesn't have records.
SELECT COALESCE(sum(orders.price),0), dummy.time as hour
FROM orders
RIGHT JOIN dummy_time as dummy
ON hour(orders.time) = dummy.time
AND orders.state = 1
AND orders.time BETWEEN '2014-06-17 00:00:00' AND '2014-06-17 23:59:59'
GROUP BY dummy.time
http://www.sqlfiddle.com/#!2/c7adb/2
The query plan for the query below looks worse than the one above but because you reported that the JOIN seems to be the main source of slowness it's worth a try. The query below reduces the set of rows before doing a JOIN.
SELECT
COALESCE(t1.orders_sum,0),
t2.time
FROM
(
SELECT
sum(orders.price) orders_sum,
hour(orders.time) orders_hour
FROM orders
WHERE orders.state = 1
AND orders.time BETWEEN '2014-06-17 00:00:00' AND '2014-06-17 23:59:59'
GROUP BY hour(orders.time)
) t1 RIGHT JOIN dummy_time t2 ON t1.orders_hour = t2.time
http://www.sqlfiddle.com/#!2/0775b/1
Also, make sure your tables are indexed
CREATE INDEX test_index1 ON orders (state,time);
CREATE INDEX test_index2 ON dummy_time (time);
Is this what you are looking for? It uses a case for when state=1 versus state=0 on whether to display sum orders or whether to display 0. If not please let me know your desired result.

Query joining in millions of records is slow, help me optimize please

Here's my query:
SELECT SQL_BUFFER_RESULT SQL_BIG_RESULT users.id, users.email,
COUNT(av.user_id) AS article_views_count,
COUNT(af.id) AS article_favorites_count,
COUNT(lc.user_id) AS link_clicks_count,
COUNT(ai.user_id) AS ad_impressions_count,
COUNT(ac.user_id) AS ad_clicks_count
FROM users
LEFT JOIN article_views AS av ON (av.user_id = users.id AND av.created_at >= '2012-11-28 00:00:00' AND av.created_at <= '2012-11-30 23:59:59')
LEFT JOIN article_favorites AS af ON (af.user_id = users.id AND af.created_at >= '2012-11-28 00:00:00' AND af.created_at <= '2012-11-30 23:59:59')
LEFT JOIN link_clicks AS lc ON (lc.user_id = users.id AND lc.created_at >= '2012-11-28 00:00:00' AND lc.created_at <= '2012-11-30 23:59:59')
LEFT JOIN ad_impressions AS ai ON (ai.user_id = users.id AND ai.created_at >= '2012-11-28 00:00:00' AND ai.created_at <= '2012-11-30 23:59:59')
LEFT JOIN ad_clicks AS ac ON (ac.user_id = users.id AND ac.created_at >= '2012-11-28 00:00:00' AND ac.created_at <= '2012-11-30 23:59:59')
GROUP BY users.id
HAVING (article_views_count + article_favorites_count + link_clicks_count + ad_impressions_count + ad_clicks_count) > 0
Some stats to give you context:
users: 1,474,348 rows
article_views: 32,603,637 rows
article_favorites: 10,199 rows
link_clicks: 4,258,901 rows
ad_impressions: 66,758,573 rows
ad_clicks: 324,125 rows
Every table that is joined in has a composite index on user_id and created_at (in that order).
We're running Mysql 5, every table is MyISAM engine.
Here's an EXPLAIN of the query: https://gist.github.com/4197482
The goal is to only return users that have any activity (view, favorite, click, impression, ad click) within the time period.
Any ideas to optimize this bad boy?
Your query seems to be an analytical query to make some analysis based on large amount of data ( as it contains an aggregation function and a GROUP BY clause).
To improve performance on such queries, you can create a materialized view result of then JOIN with somethink like:
CREATE TABLE my_view AS SELECT ... FROM ... JOIN ...
By doing that, the next query will be much more efficient as MySQL will only have to calculate the aggregation
You will then just have to implement a strategy to refresh the table (via a timestamp for example)
Another solution is to import your data in a DBMS which is built to be efficient on this kind of querires: column oriented databases. For example, InfiniDB which is an open source dbms based on MySQL with a storage engine optimized for analytical queries.
Try to split query to INNER JOIN with each table and combine them with UNION.
Like
SELECT users.id, users.email, COUNT(av.user_id) AS article_views_count
FROM users
JOIN article_views AS av ON (av.user_id = users.id AND av.created_at >= '2012-11-28 00:00:00' AND av.created_at <= '2012-11-30 23:59:59')
GROUP BY users.id, users.email
UNION
....