Calculating an "activation" percentage using SQL - mysql

So I have the following SQL schema (http://sqlfiddle.com/#!2/b366c) and what I'm trying to achieve is the % of companies that I can consider activated.
In the schema, you can see there are the following tables
organisations (otherwise known as companies)
competitions
competitionmembers
activity_entries
What I would like to do is, figure out the % of companies in the organisations (i.e. total users) that create a competition (competitions table), invite at least another person (competitionmembers table) and have completed at least one activity (activity_entries table)
This may be too complex, but what I'd like to do is also create a funnel - to visualise where most companies drop off. For this, I understand I should create a seperate query for each of the steps and then just stack them to see the flow.
Using the sample data provided here (http://sqlfiddle.com/#!2/b366c) you can see that:
1. 4 companies have registered
2. 2 companies have created competitions
3. 1 company has a competition with at least 2 participants (not just the admin)
4. 1 company has registered at least one activity
So 25% of companies became "activated".
I would really appreciate some help in building these queries and visualising percentages!

Maybe not the most efficient way, but the intermediate results ought to be small enough for this not to matter overmuch.
You can run the inner queries on their own to look at the different results:
SELECT COUNT(oid) AS organizations,
SUM(IF(competitions > 0, 1, 0)) AS competing,
SUM(IF(activations > 0, 1, 0)) AS activated,
100.0*SUM(IF(activations > 0, 1, 0))/COUNT(oid) AS actpercent
FROM (
SELECT oid,
SUM(IF(cid IS NULL,0,1)) AS competitions,
SUM(IF(aid IS NULL,0,1)) AS activations
FROM (
SELECT
o.organisationId AS oid,
c.competitionId AS cid,
a.id AS aid
FROM organisations AS o
LEFT JOIN competitions c USING (organisationId)
LEFT JOIN activity_entries AS a USING (competitionId)
) AS situation GROUP BY oid
) AS summary;
First we get the situation list of all organizations, competitions and activities; here you may add a WHERE condition to filter organizations of interest, removed competitions and so on.
From this we get a summary of organizations with the number of competitions and activations for each. Each competition can only count for one if it's activated (i.e., if you get three competitions, one with three activities and two with zero, you will retrieve three as the number of competitions, one as the number of activations).
Then we just get the total count of organizations, and calculate the number of activations as a percentage.
The output of the above would be,
ORGANIZATIONS COMPETING ACTIVATED ACTPERCENT
4 2 1 25
Addition
lserni would it be possible to add one more layer to your query, which
is the "inviting" aspect. i.e. if there are more than 2 users in the
competitionmembers table for a competition?
In this case for each competition we need to know how many members there are in another table. So we have to act on the query where the competitionId is available, and we modify situation:
SELECT
o.organisationId AS oid,
c.competitionId AS cid,
a.id AS aid
FROM organisations AS o
LEFT JOIN competitions c USING (organisationId)
LEFT JOIN activity_entries AS a USING (competitionId)
We just add the necessary GROUP BY existing-columns and the new aggregate field, and of course the necessary LEFT JOIN:
SELECT
o.organisationId AS oid,
c.competitionId AS cid,
a.id AS aid,
COUNT(m.id) AS members
FROM organisations AS o
LEFT JOIN competitions c USING (organisationId)
LEFT JOIN activity_entries AS a USING (competitionId)
LEFT JOIN competitionmembers AS m ON (c.competitionId = m.competitionid)
GROUP BY oid, cid, aid;
(which I think illustrates one of the advantages of nested "serialized" queries - they're easier to maintain. That at least is my opinion. Maybe the truth it's just that I can't wrap myself around the more complicated, all-in-one queries...).
Now that we have members of competition, we look to the query immediately external to the one above:
SELECT oid,
SUM(IF(cid IS NULL,0,1)) AS competitions,
SUM(IF(aid IS NULL,0,1)) AS activations
FROM v_situation GROUP BY oid
By the way: you can simplify the writing of these queries by offloading them to VIEWs. CREATE VIEW v_situation AS SELECT o.organisationId AS oid, ... GROUP BY oid, cid, aid; and you have a virtual table v_situation that you can use wherever you would a table).
...and rewrite it adding the number of competitions with one member and those with more:
SELECT oid,
SUM(IF(cid IS NULL,0,1)) AS competitions,
SUM(IF(aid IS NULL,0,1)) AS activations,
SUM(IF(members > 1, 1, 0)) AS withmany,
SUM(IF(members = 1, 1, 0)) AS withone
FROM ( ... ) AS situation
GROUP BY oid;
Then you just need to decide what to do with that information. You can pass it through and re-select the withone field in the parent query, or you can calculate its percentage. Only in this case remember that the number of competitions may be zero, so you need to arm against the case when
activations_with_many_members / activations
has a zero at the denominator, using IF to change the formula to 0.0 if no activations are present:
IF(activations > 0, <percent formula>, 0.0 ) AS percent_with_many
Also, if you only wanted members wherever an activation is also present, you should do so in the definition of members, so that a member is counted only if its id is not null (we have a member) and the aid is not null (we have activation):
SUM(IF(a.id IS NOT NULL AND m.id IS NOT NULL,1, 0)) AS members

select 1/ count(organisations.organisationId) * 100 *
(select count(distinct(org.organisationId)) from organisations org
inner join competitions cmp on org.organisationId = cmp.organisationId
inner join competitionmembers cmpm on cmpm.competitionid = cmp.competitionid
inner join activity_entries act on act.competitionid = cmpm.competitionid) as pct
from organisations

Related

Find count of rows in multiple tables based on a foreign key in a given table

I have a database that contains the following tables I am concerned with.
JobAreas (Base table for which I want to query other tables)
JobSkills (Every Job Skill belongs to a Job Area via foreign key i.e. parent_id)
Jobs (Every job must belong to a Job Area via foreign key i.e. category_id)
UserSkills (This table contains the JobSkill that is related to a Job Area)
I am attaching the table structures.
I am trying to create a SQL query that can give me the number of skills, number of jobs and number of people for various Job Areas. Though calculating Users who offer services in a particular Job Area appears to be tough because it is connected indirectly. I tried to get Number of Skills and Number of Jobs for all Job Areas using the following query.
select
t.id,
t.title,
count(s.parent_id) as skillsCount,
count(m.category_id) as jobCount
from
job_areas t
left join skill_types s ON s.parent_id = t.id
left join job_requests m ON m.category_id = t.id
group by
t.id
But it is not giving the correct data. Can someone guide me in right direction on how to achieve this.
You are joining along different dimensions. The quick-and-dirty way to fix this is to use count(distinct):
select t.id, t.title,
count(distinct s.parent_id) as skillsCount,
count(distinct m.category_id) as jobCount
from job_areas t left join
skill_types s
ON s.parent_id = t.id left join
job_requests m
ON m.category_id = t.id
group by t.id;
This works fine if there are just a handful of skills and categories for each job. If there are many, you want to pre-aggregate before the join or use correlated subqueries.

MySQL LEFT JOIN Count column Join 3 tables

I have 3 tables which are part of my database.
debates (id 'PK', unit_id, starter_pack_id 'FK', title)
debate_stakeholders (id 'PK', starter_pack_id 'FK', name)
debate_groups (id 'PK', debate_id 'FK', student_id, stakeholder_id 'FK')
For this purpose all debates share the same stakeholders (4 stakeholders in total, all of these stakeholders are referenced for all debates).
The aim of my expected outcome to query all the debates, that shows the debates.id, debates.title, debate_stakeholders.name, and the Count of how many of those stakeholders occur within that particular debate, in relation to the relative stakeholder irrespective if the count of stakeholders is 0. This part is important as when I perform additional queries, I need to know which query counts are greater than or equal to one, zero and null.
Here is the sample data of my database:
My expected outcome: (The count is just to show what It could look like)
I have attempted to create this MySQL query, but I am unable to achieve my exact requirements.
I have tried queries such as
SELECT
a.id,
a.name,
a.abbreviation,
d.id AS debateId,
IF(COUNT(b.stakeholder_id) = 0, 0, COUNT(b.stakeholder_id)) AS total_freq
FROM
debate_stakeholders a LEFT JOIN debate_groups b ON b.stakeholder_id = a.id
LEFT JOIN debates as d ON b.debate_id = d.id
GROUP BY
a.id, b.debate_id,d.id
HAVING
COUNT(*) < 3
ORDER BY a.id,d.id
LIMIT 1
But that hasen't quite planned out.
I must admit that the table names confuse me. A debate__stakeholder is not related to a debate. It's rather a stakeholder belonging to a starter pack and there are also debates belonging to a starter pack. At least this is what I read from the table structures. Then a debate_group consists of a single student plus a stakeholder in a debate. It is strange to call this a group.
However, it seems you want to combine all stakeholders with all debates in a starter pack (i.e. get all combinations). Then you want to count how many students are related to each such debate / stakeholder combination. So write a query to count students per debate and stakeholder (an aggregation query grouped by debate and stakeholder) and use this as a subquery you outer-join to the debate / stakeholder combinations.
SELECT d.id,
d.title,
ds.name,
COALESCE(dg.students, 0) AS "count"
FROM debates d
JOIN debate_stakeholders ds
ON ds.starter_pack_id = d.starter_pack_id
LEFT JOIN
(
SELECT debate_id, stakeholder_id, COUNT(*) AS students
FROM debate_groups
GROUP BY debate_id, stakeholder_id
) dg
ON dg.debate_id = d.id AND
dg.stakeholder_id = ds.id;
Demo here:
SQLFiddle
Try this, hope this will help:
SELECT d.id, COALESCE(d.title, 'NA') AS title, COALESCE(ds.name, 'NA') AS name, COUNT(ds.id) AS count
FROM debates d
LEFT JOIN debate_groups AS dg ON d.id = dg.debate_id
LEFT JOIN debate_stakeholders AS ds ON dg.stakeholder_id = ds.id
GROUP BY d.id, dg.stakeholder_id

mySQL - How to do this query?

I'm trying to answer to the following query:
Select the first name and last name of the clients which rent films (that have DVD's) from all the categories, ordering by first name and last name.
Database consists in:
(better view - open in a new tab)
Inventory -> DVD's
Rental -> Rents customers did
Category table:
| category_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(25) | YES | | NULL |
My doubt is in how to assign that a field from a query must contain all ids from another query (categories).
I mean I understand the fact we can natural join inventory with rental and film, and then find an id that fails on a single category, then we know he doesn't contain all... But I can't complete this.
I have this solution (But I can't understand it very well):
SELECT first_name, last_name
FROM customer AS C WHERE NOT EXISTS
(SELECT * FROM category AS K WHERE NOT EXISTS
(SELECT * FROM (film NATURAL JOIN inventory) NATURAL JOIN rental
WHERE C.customer_id = customer_id AND K.category_id = category_id));
Are there any other solutions?
On our projects, we NEVER use NATURAL JOIN. That doesn't work for us, because the PRIMARY KEY is always a surrogate column named id, and the foreign key columns are always tablename_id.
A natural join would match id in one table to id in the other table, and that's not what we want. We also frequently have "housekeeping" columns in the tables that are named the same, such as version column used for optimistic locking pattern.
And even if our naming conventions were different, and the join columns were named the same, there would be a potential for a join in an existing query to change if we added a column to a table that was named the same as a column in another table.
And, reading SQL statement that includes a NATURAL JOIN, we can't see what columns are actually being matched, without running through the table definitions, looking for columns that are named the same. That seems to put an unnecessary burden on the reader of the statement. (A SQL statement is going to be "read" many more times than it's written... the author of the statement saving keystrokes isn't a beneficial tradeoff for ambiguity leading to extra work by future readers.
(I know others have different opinions on this topic. I'm sure that successful software can be written using the NATURAL JOIN pattern. I'm just not smart enough or good enough to work with that. I'll give significant weight to the opinions of DBAs that have years of experience with database modeling, implementing schemas, writing and tuning SQL, supporting operational systems, and dealing with evolving requirements and ongoing maintenance.)
Where was I... oh yes... back to regularly scheduled programming...
The image of the schema is way too small for me to decipher, and I can't seem to copy any text from it. Output from a SHOW CREATE TABLE is much easier to work with.
Did you have a SQL Fiddle setup?
I don't thin the query in the question will actually work. I thought there was a limitation on how far "up" a correlated subquery could reference an outer query.
To me, it looks like this predicate
WHERE C.customer_id = customer_id
^^^^^^^^^^^^^
is too deep. The subquery that's in isn't allowed to reference columns from C, that table is too high up. (Maybe I'm totally wrong about that; maybe it's Oracle or SQL Server or Teradata that has that restriction. Or maybe MySQL used to have that restriction, but a later version has lifted it.)
OTHER APPROACHES
As another approach, we could get each customer and a distinct list of every category that he's rented from.
Then, we could compare that list of "customer rented category" with a complete list of (distinct) category. One fairly easy way to do that would be to collapse each list into a "count" of distinct category, and then compare the counts. If a count for a customer is less than the total count, then we know he's not rented from every category. (There's a few caveats, We need to ensure that the customer "rented from category" list contains only categories in the total category list.)
Another approach would be to take a list of (distinct) customer, and perform a cross join (cartesian product) with every possible category. (WARNING: this could be fairly large set.)
With that set of "customer cross product category", we could then eliminate rows where the customer has rented from that category (probably using an anti-join pattern.)
That would leave us with a set of customers and the categories they haven't rented from.
OP hasn't setup a SQL Fiddle with tables and exemplar data; so, I'm not going to bother doing it either.
I would offer some example SQL statements, but the table definitions from the image are unusable; to demonstrate those statements actually working, I'd need some exemplar data in the tables.
(Again, I don't believe the statement in the question actually works. There's no demonstration that it does work.)
I'd be more inclined to test it myself, if it weren't for the NATURAL JOIN syntax. I'm not smart enough to figure that out, without usable table definitions.
If I worked on that, the first think I would do would be to re-write it to remove the NATURAL keyword, and add actual predicates in an actual ON clause, and qualify all of the column references.
And the query would end up looking something like this:
SELECT c.first_name
, c.last_name
FROM customer c
WHERE NOT EXISTS
( SELECT 1
FROM category k
WHERE NOT EXISTS
( SELECT 1
FROM film f
JOIN inventory i
ON i.film_id = f.film_id
JOIN rental r
ON r.inventory_id = i.inventory_id
WHERE f.category_id = k.category_id
AND r.customer_id = c.customer_id
)
)
(I think that reference to c.customer_id is too deep to be valid.)
EDIT
I stand corrected on my conjecture that the reference to C.customer_id was too many levels "deep". That query doesn't throw an error for me.
But it also doesn't seem to return the resultset that we're expecting, I may have screwed it up somehow. Oh well.
Here's an example of getting the "count of distinct rental category" for each customer (GROUP BY c.customer_id, just in case we have two customers with the same first and last names) and comparing to the count of category.
SELECT c.last_name
, c.first_name
FROM customer c
JOIN rental r
ON r.customer_id = c.customer_id
JOIN inventory i
ON i.inventory_id = r.inventory_id
JOIN film f
ON f.film_id = i.film_id
GROUP
BY c.last_name
, c.first_name
, c.customer_id
HAVING COUNT(DISTINCT f.category_id)
= (SELECT COUNT(DISTINCT a.category_id) FROM category a)
ORDER
BY c.last_name
, c.first_name
, c.customer_id
EDIT
And here's a demonstration of the other approach, generating a cartesian product of all customers and all categories (WARNING: do NOT do this on LARGE sets!), and find out if any of those rows don't have a match.
-- customers who have rented from EVERY category
-- h = cartesian (cross) product of all customers with all categories
-- g = all categories rented by each customer
-- perform outer join, return all rows from h and matching rows from g
-- if a row from h does not have a "matching" row found in g
-- columns from g will be null, test if any rows have null values from g
SELECT h.last_name
, h.first_name
FROM ( SELECT hi.customer_id
, hi.last_name
, hi.first_name
, hj.category_id
FROM customer hi
CROSS
JOIN category hj
) h
LEFT
JOIN ( SELECT c.customer_id
, f.category_id
FROM customer c
JOIN rental r
ON r.customer_id = c.customer_id
JOIN inventory i
ON i.inventory_id = r.inventory_id
JOIN film f
ON f.film_id = i.film_id
GROUP
BY c.customer_id
, f.category_id
) g
ON g.customer_id = h.customer_id
AND g.category_id = h.category_id
GROUP
BY h.last_name
, h.first_name
, h.customer_id
HAVING MIN(g.category_id IS NOT NULL)
ORDER
BY h.last_name
, h.first_name
, h.customer_id
I will take a stab at this, only because I am curious why the answer proposed seems so complex. First, a couple of questions.
So your question is: "Select the first name and last name of the clients which rent films (that have DVD's) from all the categories, ordering by first name and last name."
So, just go through the rental database, joining customer. I am not sure what the category part has anything to do with this, as you are not selecting or displaying any category, so that does not need to be part of the search, it is implied as when they rent a DVD, that DVD has a category.
SELECT C.first_name, C.last_name
FROM customer as C JOIN rental as R
ON (C.customer_id = R.customer_id)
WHERE R.return_date IS NOT NULL;
So, you are looking for movies that are currently rented, and displaying the first and last names of customers with active rentals.
You can also do some UNIQUE to reduce the number of duplicate customers that show up in the list.
Does this help?!

Best way to structure SQL queries with many inner joins?

I have an SQL query that needs to perform multiple inner joins, as follows:
SELECT DISTINCT adv.Email, adv.Credit, c.credit_id AS creditId, c.creditName AS creditName, a.Ad_id AS adId, a.adName
FROM placementlist pl
INNER JOIN
(SELECT Ad_id, List_id FROM placements) AS p
ON pl.List_id = p.List_id
INNER JOIN
(SELECT Ad_id, Name AS adName, credit_id FROM ad) AS a
ON ...
(few more inner joins)
My question is the following: How can I optimize this query? I was under the impression that, even though the way I currently query the database creates small temporary tables (inner SELECT statements), it would still be advantageous to performing an inner join on the unaltered tables as they could have about 10,000 - 100,000 entries (not millions). However, I was told that this is not the best way to go about it but did not have the opportunity to ask what the recommended approach would be.
What would be the best approach here?
To use derived tables such as
INNER JOIN (SELECT Ad_id, List_id FROM placements) AS p
is not recommendable. Let the dbms find out by itself what values it needs from
INNER JOIN placements AS p
instead of telling it (again) by kinda forcing it to create a view on the table with the two values only. (And using FROM tablename is even much more readable.)
With SQL you mainly say what you want to see, not how this is going to be achieved. (Well, of course this is just a rule of thumb.) So if no other columns except Ad_id and List_id are used from table placements, the dbms will find its best way to handle this. Don't try to make it use your way.
The same is true of the IN clause, by the way, where you often see WHERE col IN (SELECT DISTINCT colx FROM ...) instead of simply WHERE col IN (SELECT colx FROM ...). This does exactly the same, but with DISTINCT you tell the dbms "make your subquery's rows distinct before looking for col". But why would you want to force it to do so? Why not have it use just the method the dbms finds most appropriate?
Back to derived tables: Use them when they really do something, especially aggregations, or when they make your query more readable.
Moreover,
SELECT DISTINCT adv.Email, adv.Credit, ...
doesn't look to good either. Yes, sometimes you need SELECT DISTINCT, but usually you wouldn't. Most often it is just a sign that you haven't thought your query through.
An example: you want to select clients that bought product X. In SQL you would say: where a purchase of X EXISTS for the client. Or: where the client is IN the set of the X purchasers.
select * from clients c where exists
(select * from purchases p where p.clientid = c.clientid and product = 'X');
Or
select * from clients where clientid in
(select clientid from purchases where product = 'X');
You don't say: Give me all combinations of clients and X purchases and then boil that down so I just get each client once.
select distinct c.*
from clients c
join purchases p on p.clientid = c.clientid and product = 'X';
Yes, it is very easy to just join all tables needed and then just list the columns to select and then just put DISTINCT in front. But it makes the query kind of blurry, because you don't write the query as you would word the task. And it can make things difficult when it comes to aggregations. The following query is wrong, because you multiply money earned with the number of money-spent records and vice versa.
select
sum(money_spent.value),
sum(money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
And the following may look correct, but is still incorrect (it only works when the values happen to be unique):
select
sum(distinct money_spent.value),
sum(distinct money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
Again: You would not say: "I want to combine each purchase with each earning and then ...". You would say: "I want the sum of money spent and the sum of money earned per user". So you are not dealing with single purchases or earnings, but with their sums. As in
select
sum(select value from money_spent where money_spent.userid = user.userid),
sum(select value from money_earned where money_earned.userid = user.userid)
from user;
Or:
select
spent.total,
earned.total
from user
join (select userid, sum(value) as total from money_spent group by userid) spent
on spent.userid = user.userid
join (select userid, sum(value) as total from money_earned group by userid) earned
on earned.userid = user.userid;
So you see, this is where derived tables come into play.

MySQL - 3 tables, is this complex join even possible?

I have three tables: users, groups and relation.
Table users with fields: usrID, usrName, usrPass, usrPts
Table groups with fields: grpID, grpName, grpMinPts
Table relation with fields: uID, gID
User can be placed in group in two ways:
if collect group minimal number of points (users.usrPts > group.grpMinPts ORDER BY group.grpMinPts DSC LIMIT 1)
if his relation to the group is manually added in relation tables (user ID provided as uID, as well as group ID provided as gID in table named relation)
Can I create one single query, to determine for every user (or one specific), which group he belongs, but, manual relation (using relation table) should have higher priority than usrPts compared to grpMinPts? Also, I do not want to have one user shown twice (to show his real group by points, but related group also)...
Thanks in advance! :) I tried:
SELECT * FROM users LEFT JOIN (relation LEFT JOIN groups ON (relation.gID = groups.grpID) ON users.usrID = relation.uID
Using this I managed to extract specified relations (from relation table), but, I have no idea how to include user points, respecting above mentioned priority (specified first). I know how to do this in a few separated queries in php, that is simple, but I am curious, can it be done using one single query?
EDIT TO ADD:
Thanks to really educational technique using coalesce #GordonLinoff provided, I managed to make this query to work as I expected. So, here it goes:
SELECT o.usrID, o.usrName, o.usrPass, o.usrPts, t.grpID, t.grpName
FROM (
SELECT u.*, COALESCE(relationgroupid,groupid) AS thegroupid
FROM (
SELECT u.*, (
SELECT grpID
FROM groups g
WHERE u.usrPts > g.grpMinPts
ORDER BY g.grpMinPts DESC
LIMIT 1
) AS groupid, (
SELECT grpUID
FROM relation r
WHERE r.userUID = u.usrID
) AS relationgroupid
FROM users u
)u
)o
JOIN groups t ON t.grpID = o.thegroupid
Also, if you are wondering, like I did, is this approach faster or slower than doing three queries and processing in php, the answer is that this is slightly faster way. Average time of this query execution and showing results on a webpage is 14 ms. Three simple queries, processing in php and showing results on a webpage took 21 ms. Average is based on 10 cases, average execution time was, really, a constant time.
Here is an approach that uses correlated subqueries to get each of the values. It then chooses the appropriate one using the precedence rule that if the relations exist use that one, otherwise use the one from the groups table:
select u.*,
coalesce(relationgroupid, groupid) as thegroupid
from (select u.*,
(select grpid from groups g where u.usrPts > g.grpMinPts order by g.grpMinPts desc limit 1
) as groupid,
(select gid from relations r where r.userId = u.userId
) as relationgroupid
from users u
) u
Try something like this
select user.name, group.name
from group
join relation on relation.gid = group.gid
join user on user.uid = relation.uid
union
select user.name, g1.name
from group g1
join group g2 on g2.minpts > g1.minpts
join user on user.pts between g1.minpts and g2.minpts