Multiple LEFT JOINs to self with criteria to produce distribution - mysql

Although several . questions . come . close . to what I want (and as I write this stackoverflow has suggested several more, none of which quite capture my problem), I just don't seem to be able to find my way out of the SQL thicket.
I have a single table (let's call it the user_classification_fct) that has three fields: user, week, and class (e.g. user #1 in week #1 had a class of 'Regular User', while user #2 in week #1 has a class of 'Infrequent User'). (As an aside, I have implemented classes as INTs, but wanted to work with something legible in the form of VARCHAR while I sorted out the SQL.)
What I want to do is produce a summary report of how user behaviour is changing in aggregate along the lines of:
There were 50 users who were regular users in both week 1 and week 2 and ...
There were 10 users who were regular users in week 1, but fell to infrequent users in week 2
There were 5 users who went from infrequent in week 1 to regular in week 2
... and so on ...
What makes this slightly more tricky is that user #5000 might only have started using the service in week 2 and so have no record in the table for week 1. In that case, I'd want to see a NULL FOR week 1 and a 'Regular User' (or whatever is appropriate) for week 2. The size of the table is not strictly relevant, but with 5 weeks' worth of data I'm looking at 42 million rows, so I do not want to insert 4 'fake' rows of 'Non-User' for someone who only starts using the service in week 5 or something.
To me this seems rather obviously like a case for using a LEFT or RIGHT JOIN in MySQL because the NULL should come through on the 'missing' record.
I have tried using both WHERE and AND conditions on the LEFT JOINs and am just not getting the 'right' answers (i.e. I either get no NULL values at all in the case of trailing WHERE conditions, or my counts are far, far too high for the number of distinct users (which is ca. 10 million) in the case of the AND constraints used below). Here's was my last attempt to get this working:
SELECT
ucf1.class_nm AS 'Class in 2012/15',
ucf2.class_nm AS 'Class in 2012/16',
ucf3.class_nm AS 'Class in 2012/17',
ucf4.class_nm AS 'Class in 2012/18',
ucf5.class_nm AS 'Class in 2012/19',
count(*) AS 'Count'
FROM
user_classification_fct ucf5
LEFT JOIN user_classification_fct ucf4
ON ucf5.user_id=ucf4.user_id
AND ucf5.week_key=201219 AND ucf4.week_key=201218
LEFT JOIN user_classification_fct ucf3
ON ucf4.user_id=ucf3.user_id
AND ucf4.week_key=201218 AND ucf3.week_key=201217
LEFT JOIN user_classification_fct ucf2
ON ucf3.user_id=ucf2.user_id
AND ucf3.week_key=201217 AND ucf2.week_key=201216
LEFT JOIN user_classification_fct ucf1
ON ucf2.user_id=ucf1.user_id
AND ucf2.week_key=201216 AND ucf1.week_key=201215
GROUP BY 1,2,3,4,5;
In looking at the various other questions on stackoverflow.com, it may well be that I need to perform the queries one-at-a-time and UNION the result sets together or use parentheses to chain them one-to-another, but those approaches are not ones that I'm familiar with (yet) and I can't even get a single LEFT JOIN (i.e. week 5 to week 1, dropping all the other weeks of data) to return something useful.
Any tips would be much, much appreciated and I would really appreciate suggestions that work in MySQL as switching database products is not an option.

You can do this with a group by. I would start by summarizing all the possible combinations for the five weeks as:
select c_201215, c_201216, c_201217, c_201218, c_201219,
count(*) as cnt
from (select user_id,
max(case when week_key=201215 then class_nm end) as c_201215,
max(case when week_key=201216 then class_nm end) as c_201216,
max(case when week_key=201217 then class_nm end) as c_201217,
max(case when week_key=201218 then class_nm end) as c_201218,
max(case when week_key=201219 then class_nm end) as c_201219
from user_classification_fct ucf
group by user_id
) t
group by c_201215, c_201216, c_201217, c_201218, c_201219
This may solve your problem. If you have 5 classes (including NULL), then this will return at most 5^5 or 3,125 rows.
This fits into Excel, so you can do the final processing there. Alternatively, you can still use the database.
If you want to extract pairs of weeks, then I would suggest putting the above into a temporary table, say "t". And doing a series of extracts with unions:
select *
from ((select '201215' as weekstart, c_201215, c_201216, sum(cnt) as cnt
from t
group by c_201215, c_201216
) union all
(select '201216', c_201216, c_201217, sum(cnt) as cnt
from t
group by c_201216, c_201217
) union all
(select '201217', c_201217, c_201218, sum(cnt) as cnt
from t
group by c_201217, c_201218
) union all
(select '201218', c_201218, c_201219, sum(cnt) as cnt
from t
group by c_201218, c_201219
)
) tg
order by 1, cnt desc
I suggest putting it in a subquery because you don't want to message around with common-subquery optimizations on such a large table. You'll get to your final answer by summarizing first, and then bringing the data together.

Related

I would like to know if there is a better way to write this query (multiple joins of the same table)

here is the problem:
I have vehicles table in db (fields of this table are not so important), what's important is that each vehicle has a model_id, which refers to the vehicle_models table.
Vehicle models table has id, class, model, series, cm3hp, created_at and updated_at fields.
I need to define the stock age in terms of how many vehicles of the certain model class are on the stock by the given criteria. The criteria being: 0-30 days, 31-60 days, 61-90 days... 360 + days...
I don't know if it is clear enough but let me try to explain even better: For each day range I need to find the count of vehicles with the given model class. There are other criteria but that's not important for what I am trying to find out. To help you better understand the problem I'll include the screenshot of how the structure should look like:
I am using MySQL 8.
The query I wrote is:
SELECT DISTINCT vm.class,
IFNULL(t1.count, 0) as t1c,
IFNULL(t2.count, 0) as t2c,
IFNULL(t3.count, 0) as t3c,
IFNULL(t4.count, 0) as t4c,
IFNULL(t5.count, 0) as t5c,
IFNULL(t6.count, 0) as t6c,
IFNULL(t7.count, 0) as t7c
FROM vehicle_models vm
LEFT JOIN (
SELECT
vm.class as class,
count(*) as count
FROM a3s186jg7ffmm0q8.vehicles v
JOIN vehicle_models vm
ON vm.id = v.model_id
WHERE
DATEDIFF(IFNULL(v.retail_date, now()), v.wholesale_date) BETWEEN 0 AND 30
GROUP BY vm.class
) t1 ON t1.class = vm.class
*** MORE SAME LEFT JOINS ***
ORDER BY vm.class;
Now, this provides desired results, but what I would like to know if there is a better way to write this query in terms of performance and also code structure.
I guesss you are presenting a report of inventory aging (of how long that car sits on the dealer's lot before somebody buys it). You can put the age ranges in your top-level select rather than putting each one in a separate subquery. That will make your query faster (subqueries have a cost) and shorter / easier to read.
Try something like this nested query. The inner query gives back one row per vehicle with its aging number. The outer query aggregates them.
SELECT class,
COUNT(*) total,
SUM(age BETWEEN 0 AND 30) t1c,
SUM(age BETWEEN 31 AND 60) t2c,
SUM(age BETWEEN 61 AND 90) t3c,
... etc ...
FROM (
SELECT vm.class,
DATEDIFF(IFNULL(v.retail_date, now()), v.wholesale_date) age
FROM a3s186jg7ffmm0q8.vehicles v
JOIN vehicle_models vm ON vm.id = v.model_id
) subq
GROUP BY class
ORDER BY class;
This SUM() trick works in MySQL because expressions like age BETWEEN 0 AND 30 have the value 1 when true and 0 when false.

MySQL Left Join throwing off my count numbers

I'm doing a left join on a table to get the number of leads we've generated today and how many times we've called those leads. I figured a left join would be the best thing to do, so I wrote the following query:
SELECT
COUNT(rad.phone_number) as lead_number, rals.lead_source_name as source, COUNT(racl.phone_number) as calls, SUM(case when racl.contacted = 1 then 1 else 0 end) as contacted
FROM reporting_app_data rad
LEFT JOIN reporting_app_call_logs racl ON rad.phone_number = racl.phone_number, reporting_app_lead_sources rals
WHERE DATE(rad.created_at) = CURDATE() AND rals.list_id = rad.lead_source
GROUP BY rad.lead_source;
But the problem with that, is that if in the reporting_app_call_logs table, there are multiple entries for the same phone number (so a phone number has been called multiple times that day), the lead_number (which I want to count how many leads were generated on the current day grouped by lead_source) equals how many calls there are. So the count from the LEFT table equals the count from the RIGHT table.
How do I write a SQL query that gets the number of leads and the total number of calls per lead source?
Try COUNT(DISTINCT expression)
In other words, change COUNT(rad.phone_number) to COUNT(DISTINCT rad.phone_number)

Build sql query from multiple tables for cyfe dashboard

For the purpose of monitoring my data from my users I want to visualise my data in a Cohort analysis. Lets say that i have the following tables in my database:
Table: track_register
user_id, date, time
And in the following table:
Table: track_loginuser_id, date, time, succes
How i want my cohort analysis to look is like:
Months Sign Ups loged in more then once
May 40 80%
I am using Cyfe to visualise this so the data has to be formatted in a table like this:
Month,Sign Ups,Loged in more then once
May 2015,40,32
Jun 2015,60,55
(click here for cyfe example)
Eventually i want to add more data to the cohort from other tables such as percentage of users who actually bought the product and more of that good stuff.
The first set of data (the signups per month) is not the hard part. But what i am struggling with is how to fetch the data from the track login table. I will have to count the number of times a specific user has loged in and if thats > 1 then +1. I can imagine that u use CASE for that. The trouble is to separated it by the correct moth. Because the moth where de +1 supposed to go needs to be fetched from the track_register table.
Its seems kind of hard to me to put this all in one single query? But if it couldn't be done why go to the trouble of building a cohort analysis on cyfe?
Hi DATE as field name is restricted so I used DATA.
You can try this code:
SELECT TO_CHAR(NVL(a.data, b.data), 'MON YYYY') months
, COUNT(DISTINCT a.login) sign_ups
, SUM(CASE WHEN COUNT(DISTINCT b.login) > 1 THEN 1 ELSE 0 END) Loged_in_more_then_once
FROM track_register a LEFT JOIN track_login b ON a.login = b.login
GROUP BY TO_CHAR(NVL(a.data, b.data), 'MON YYYY')
ORDER BY 1
Or:
SELECT TO_CHAR(NVL(a.data, b.data), 'MON YYYY') months
, COUNT(DISTINCT a.login) sign_ups
, SUM(CASE WHEN COUNT(DISTINCT b.login) > 1 THEN 1 ELSE 0 END) Loged_in_more_then_once
FROM track_register a LEFT JOIN track_login b
ON a.login = b.login AND LAST_DAY(a.data) = LAST_DAY(b.data)
GROUP BY TO_CHAR(NVL(a.data, b.data), 'MON YYYY')
ORDER BY 1

Mysql SUM CASE with unique IDs only

Easiest explained through an example.
A father has children who win races.
How many of a fathers offspring have won a race and how many races in total have a fathers offspring won. (winners and wins)
I can easily figure out the total amount of wins but sometimes a child wins more than one race so to figure out winners I need only sum if the child has won, not all the times it has won.
In the below extract from a query I cannot use Distinct, so this doesn't work
SUM(CASE WHEN r.finish = '1' AND DISTINCT h.runnersid THEN 1 ELSE 0 END ) AS winners,
This also won't work
SUM(SELECT DISTINCT r.runnersid FROM runs r WHERE r.finish='1') AS winners
This works when I need to find the total amount of wins.
SUM(CASE WHEN r.finish = '1' THEN 1 ELSE 0 END ) AS wins,
Here is a sqlfiddle http://sqlfiddle.com/#!2/e9a81/1
Let's take this step by step.
You have two pieces of information you are looking for: Who has won a race, and how many races have they one.
Taking the first one, you can select a distinct runnersid where they have a first place finish:
SELECT DISTINCT runnersid
FROM runs
WHERE finish = 1;
For the second one, you can select every runnersid where they have a first place finish, count the number of rows returned, and group by runnersid to get the total wins for each:
SELECT runnersid, COUNT(*) AS numWins
FROM runs
WHERE finish = 1
GROUP BY runnersid;
The second one actually has everything you want. You don't need to do anything with that first query, but I used it to help demonstrate the thought process I take when trying to accomplish a task like this.
Here is the SQL Fiddle example.
EDIT
As you've seen, you don't really need the SUM here. Because finish represents a place in the race, you don't want to SUM that value, but you want to COUNT the number of wins.
EDIT2
An additional edit based on OPs requirements. The above does not match what OP needs, but I left this in as a reference to any future readers. What OP really needs, as I understand it now, is the number of children each father has that has run a race. I will again explain my thought process step by step.
First I wrote a simple query that pulls all of the winning father-son pairs. I was able to use GROUP BY to get the distinct winning pairs:
SELECT father, name
FROM runs
WHERE finish = 1
GROUP BY father, name;
Once I had done that, I used it is a subquery and the COUNT(*) function to get the number of winners for each father (this means I have to group by father):
SELECT father, COUNT(*) AS numWinningChildren
FROM(SELECT father, name
FROM runs
WHERE finish = 1
GROUP BY father, name) t
GROUP BY father;
If you just need the fathers with winning children, you are done. If you want to see all fathers, I would write one query to select all fathers, join it with our result set above, and replace any values where numWinningChildren is null, with 0.
I'll leave that part to you to challenge yourself a bit. Also because SQL Fiddle is down at the moment and I can't test what I was thinking, but I was able to test those above with success.
I think you want the father name along with the count of the wins by his sons.
select father, count(distinct(id)) wins
from runs where father = 'jack' and finish = 1
group by father
sqlfiddle
I am not sure if this is what you are looking for
Select user_id, sum(case when finish='1' then 1 else 0 end) as total
From table
Group by user_id

How to form a subquery

I think I need a subquery for this, and while I have read what subqueries are, I have not found help on how to write a subquery. I am interested in learning how to fish, but I also would like a fish soon, please :)
Simple, 1 table of data:
lastname, (found or not found boolean)
I want to generate some stats, across the whole alphabet, of who has been found.
Desired results:
A : 5 of 16 found, or about 31 percent
B : 2 of 4 found, or about 50 percent
C : 30 of 90 found, or about 30 percent
etc
I can form simple SQL, I need help with forming the subquery, if that's what is needed here.
I can write a query to list how many were found by the first letter of the last name:
select substring(lastname,1,1) as lastinitial, count(*) from members where found !=0 and found is not null group by lastinitial;
I can write a query to list how many total there are, by last initial:
select substring(lastname,1,1) as lastinitial, count(*) from members group by lastinitial;
But how do I combine the two queries to yield the desired result? Thanks for the help.
You probably don't need sub-query for this. The grouping can give you both found and not found for each name. Just add "found" to the grouping and you will get two records for each name, one for found and another for not found. You also don't need another query for the total, just add the found and not found together.
SELECT SUBSTRING(lastname,1,1) AS lastinitial,
(CASE WHEN found = 1 THEN 1 ELSE 0 END) AS found_val,
COUNT(lastname) AS found_count
FROM members
GROUP BY lastinitial, found_val;
If you want to have both of the found and not found in one row for each letter, try this:
SELECT found_list.lastinitial, found_count, not_found_count
FROM (
SELECT SUBSTRING(lastname,1,1) AS lastinitial, COUNT(lastname) AS found_count
FROM members
WHERE found = 1
GROUP BY lastinitial
) AS found_list,
(
SELECT SUBSTRING(lastname,1,1) AS lastinitial, COUNT(lastname) AS not_found_count
FROM members
WHERE found IS NULL OR found = 0
GROUP BY lastinitial
) AS not_found_list
WHERE found_list.lastinitial = not_found_list.lastinitial
As you can see, the first query is much shorter, more elegant, and also performs faster.