I would like to calculate the Click-Through Ratio (CTR) of several articles of a website using SQL.
The formula of the CTR is CTR = number clicks / number impressions, i.e. a ratio of how many times an article has been clicked and how many times it has been shown.
I have two tables:
´article_click´: A table with several columns, namely ´article_id´ (denoting the id of the article), ´description´ (a brief description of the article), ´timestamp´ (when it has been clicked), among others. Every time a user clicks an article, a new row is created in the table.
´article_impression´: Similarly, a table with several columns, namely ´article_id´ (denoting the id of the article), ´description´ (a brief description of the article), ´timestamp´ (when it has been shown), among others. Every time an article is shown to a user, a new row is created in the table.
Both tables 1 and 2 look like this:
+------------+-------------+------------------+-----+
| article_id | description | timestamp | ... |
+------------+-------------+------------------+-----+
| 102 | Potatoe | 2021-01-01 13:45 | ... |
| 11 | Lettuce | 2020-02-11 11:00 | ... |
| 34 | Train | 2019-12-12 09:31 | ... |
| 21 | Car | 2011-11-11 08:32 | ... |
| 201 | Train | 2014-02-10 02:12 | ... |
| ... | ... | ... | ... |
+------------+-------------+------------------+-----+
And I would like to create a table such that:
+------------+-----+
| article_id | CTR |
+------------+-----+
| 11 | 0.4 |
| 23 | 0.6 |
| 34 | 0.2 |
| 44 | 0.8 |
| 45 | 0.3 |
| ... | ... |
+------------+-----+
In order to do so, I have tried:
SELECT article_click.article_id, COUNT(article_click.article_id) / COUNT(article_impression.article_id) AS CTR
FROM article_click
INNER JOIN article_impression ON article_click.article_id = article_impression.article_id
GROUP BY article_click.article_id DESC;
But I obtain something like:
+------------+-----+
| article_id | CTR |
+------------+-----+
| 11 | 1.0 |
| 23 | 1.0 |
| 34 | 1.0 |
| 44 | 1.0 |
| 45 | 1.0 |
| ... | ... |
+------------+-----+
Can anyone spot the mistake here? I'm using MySQL as RDBMS.
If the click-through-rate (CTR) is number clicks / number impressions then you'll need to calculate the number of clicks on an article and the number of impressions on an article before joining them to perform the calculation.
You could do this with subqueries or CTEs, but I've opted for the former here.
SELECT c.article_id, c.click_count / i.impression_count AS CTR
FROM (
SELECT article_id, COUNT(*) AS click_count
FROM article_click
GROUP BY article_id) AS c
INNER JOIN (
SELECT article_id, COUNT(*) AS impression_count
FROM article_impression
GROUP BY article_id) AS i
ON c.article_id = i.article_id;
Try it out on SQL Fiddle.
Note that using an INNER JOIN will exclude articles that have impressions but no clicks, so you won't get results where the CTR is 0. If you want those, you can use a LEFT JOIN from impressions to clicks. Since an article cannot be clicked if it has not been shown, we know that a LEFT JOIN from impressions to clicks is sufficient to show all data.
SELECT i.article_id, COALESCE(c.click_count, 0) / i.impression_count AS CTR
FROM (
SELECT article_id, COUNT(*) AS impression_count
FROM article_impression
GROUP BY article_id) AS i
LEFT JOIN (
SELECT article_id, COUNT(*) AS click_count
FROM article_click
GROUP BY article_id) AS c
ON i.article_id = c.article_id;
Note that we have to use the article_id from article_impression since article_click might be NULL. For the same reason, we have to COALESCE the click_count so that we don't end up with an error trying to divide NULL.
Before using joins duplicate data must be avoided. Get individual counts of each table and join both the queries.
select a.article_id, article_click/article_impression_click as ctr
from ( select a.article_id, count(a.article_id) article_click from
article_click a group by article_id) a inner join (select
a.article_id, count(a.article_id) article_impression_click from
article_impression a group by article_id) b on
a.article_id=b.article_id
WITH
v_article AS
( SELECT 'S' type, article_impression.id FROM article_impression
UNION ALL
SELECT 'C' type, article_click.id FROM article_click
)
SELECT
v_article.id,
COUNT(CASE WHEN v_article.type = 'C' THEN 1 END) nb_show,
COUNT(CASE WHEN v_article.type = 'S' THEN 1 END) nb_click,
CASE
WHEN COUNT(CASE WHEN v_article.type = 'S' THEN 1 END) > 0 THEN
ROUND(COUNT(CASE WHEN v_article.type = 'C' THEN 1 END) / COUNT(CASE WHEN v_article.type = 'S' THEN 1 END), 2)
END ratio_click_show
FROM v_article
GROUP BY
v_article.id
;
If you're sure an article can be click only if it has been previously shown (nb_show > 0 and nb_show > nb_click), you can remove the CASE around the ratio calculation.
Related
I'm very average with MySQL, but usually I can write all the needed queries after reading documentation and searching for examples. Now, I'm in the situation where I spent 3 days re-searching and re-writing queries, but I can't get it to work the exact way I need. Here's the deal:
1st table (mpt_companies) contains companies:
| company_id | company_title |
------------------------------
| 1 | Company A |
| 2 | Company B |
2nd table (mpt_payment_methods) contains payment methods:
| payment_method_id | payment_method_title |
--------------------------------------------
| 1 | Cash |
| 2 | PayPal |
| 3 | Wire |
3rd table (mpt_payments) contains payments for each company:
| payment_id | company_id | payment_method_id | payment_amount |
----------------------------------------------------------------
| 1 | 1 | 1 | 10.00 |
| 2 | 2 | 3 | 15.00 |
| 3 | 1 | 1 | 20.00 |
| 4 | 1 | 2 | 10.00 |
I need to list each company along with many stats. One of stats is the sum of payments in each payment method. In other words, the result should be:
| company_id | company_title | payment_data |
--------------------------------------------------------
| 1 | Company A | Cash:30.00,PayPal:10.00 |
| 2 | Company B | Wire:15.00 |
Obviously, I need to:
Select all the companies;
Join payments for each company;
Join payment methods for each payment;
Calculate sum of payments in each method;
GROUP_CONCAT payment methods and sums;
Unfortunately, SUM() doesn't work with GROUP_CONCAT. Some solutions I found on this site suggest using CONCAT, but that doesn't produce the list I need. Other solutions suggest using CAST(), but maybe I do something wrong because it doesn't work too. This is the closest query I wrote, which returns each company, and unique list of payment methods used by each company, but doesn't return the sum of payments:
SELECT *,
(some other sub-queries I need...),
(SELECT GROUP_CONCAT(DISTINCT(mpt_payment_methods.payment_method_title))
FROM mpt_payments
JOIN mpt_payment_methods
ON mpt_payments.payment_method_id=mpt_payment_methods.payment_method_id
WHERE mpt_payments.company_id=mpt_companies.company_id
ORDER BY mpt_payment_methods.payment_method_title) AS payment_data
FROM mpt_companies
Then I tried:
SELECT *,
(some other sub-queries I need...),
(SELECT GROUP_CONCAT(DISTINCT(mpt_payment_methods.payment_method_title), ':', CAST(SUM(mpt_payments.payment_amount) AS CHAR))
FROM mpt_payments
JOIN mpt_payment_methods
ON mpt_payments.payment_method_id=mpt_payment_methods.payment_method_id
WHERE mpt_payments.company_id=mpt_companies.company_id
ORDER BY mpt_payment_methods.payment_method_title) AS payment_data
FROM mpt_companies
...and many other variations, but all of them either returned query errors, either didn't return/format data I need.
The closest answer I could find was MySQL one to many relationship: GROUP_CONCAT or JOIN or both? but after spending 2 hours re-writing the provided query to work with my data, I couldn't do it.
Could anyone give me a suggestion, please?
You can do that by aggregating twice. First for the sum of payments per method and company and then to concatenate the sums for each company.
SELECT x.company_id,
x.company_title,
group_concat(payment_amount_and_method) payment_data
FROM (SELECT c.company_id,
c.company_title,
concat(pm.payment_method_title, ':', sum(p.payment_amount)) payment_amount_and_method
FROM mpt_companies c
INNER JOIN mpt_payments p
ON p.company_id = c.company_id
INNER JOIN mpt_payment_methods pm
ON pm.payment_method_id = p.payment_method_id
GROUP BY c.company_id,
c.company_title,
pm.payment_method_id,
pm.payment_method_title) x
GROUP BY x.company_id,
x.company_title;
db<>fiddle
Here you go
SELECT company_id,
company_title,
GROUP_CONCAT(
CONCAT(payment_method_title, ':', payment_amount)
) AS payment_data
FROM (
SELECT c.company_id, c.company_title, pm.payment_method_id, pm.payment_method_title, SUM(p.payment_amount) AS payment_amount
FROM mpt_payments p
JOIN mpt_companies c ON p.company_id = c.company_id
JOIN mpt_payment_methods pm ON pm.payment_method_id = p.payment_method_id
GROUP BY p.company_id, p.payment_method_id
) distinct_company_payments
GROUP BY distinct_company_payments.company_id
;
Im not even sure what the title of this question should be but lets start out with my data.
I have a table of users who have taken a few lessons while belonging to a particular training center.
lesson table
id | lesson_id | user_id | has_completed
----------------------------------------
1 | asdf3314 | 2 | 1
2 | d13saf12 | 2 | 1
3 | a33adff5 | 2 | 0
4 | a33adff5 | 1 | 1
5 | d13saf12 | 1 | 0
user table
id | center_id | ...
----------------------------------------
1 | 20 | ...
2 | 30 | ...
training center table
id | center_name | ...
----------------------------------------
20 | learn.co | ...
30 | teach.co | ...
I've written a small chunk but am now stuck as I don't know how to proceed. This statement gets the counted total of completed lessons per user. it then figures the average completed value from a center id. if two users belong to a center and have completed 3 lessons and 2 lessons it finds the average of 3 and 2 then returns that.
SELECT
FLOOR(AVG(a.total)) AS avg_completion,
FROM
(SELECT
user_id,
user.center_id,
count(user_id) AS total
FROM lesson
LEFT JOIN user ON user.id = user_id
WHERE is_completed = 1 AND center_id = 2
GROUP BY user_id) AS a;
The question I have is how do I loop through the training centers table and also append average data from similar select statement as above to each center that is queried. I cant seem to pass the center id down to the subquery so there must be a fundamentally different way to achieve the same query but also loop through training centers.
An example of desired result:
center.id | avg_completion | ...training center table
-----------------------------------------------------
20 | 2 | ...
Your main query needs to select a.center_id and then use GROUP BY center_id. You can then join it with the training_center table.
SELECT c.*, x.avg_completion
FROM training_center AS c
JOIN (
SELECT
a.center_id,
FLOOR(AVG(a.total)) AS avg_completion
FROM (
SELECT
user_id
user.center_id,
count(*) AS total
FROM lesson
JOIN user ON user.id = user_id
WHERE is_completed = 1 AND center_id = 2
GROUP BY user_id) AS a
GROUP BY a.center_id) AS x
ON x.center_id = c.id
If I understand correctly:
select u.center_id, count(*) as num_users,
sum(l.has_completed) as num_completed,
avg(l.has_completed) as completed_ratio
from lesson l join
user u
on l.user_id = u.id
group by u.center_id
I have two tables
Accounts:
+------------+--------+
| accountsid | name |
+------------+--------+
| 1 | Bob |
| 2 | Rachel |
| 3 | Mark |
+------------+--------+
Sales Orders
+--------------+------------+------------+--------+
| salesorderid | accountsid | so_date | amount |
+--------------+------------+------------+--------+
| 1 | 1 | 2015-12-16 | 50 |
| 2 | 1 | 2016-01-13 | 20 |
| 3 | 2 | 2015-12-14 | 10 |
| 4 | 3 | 2016-02-14 | 35 |
+--------------+------------+------------+--------+
As you can see, is a 1-N relation where Accounts has many Salesorders and Salesorder has 1 Account.
I need to retrieve "old" Accounts where are not active anymore. For example, If some Account dont have Salesorder in 2016 is an inactive Account.
So, in this example the result will be ONLY Rachel.
How can i retrieve this? I think its the "opposite" of between but I cant figure how to do it...
Thanks.
PS. Despite the title I can get this without INNER JOIN.
You're looking to effect an anti-join, for which there are three possibilities in MySQL:
Using NOT IN:
SELECT a.*
FROM Accounts a
WHERE a.accountsid NOT IN (
SELECT so.accountsid
FROM `Sales Orders` so
WHERE so.so_date >= '2016-01-01'
)
Using NOT EXISTS:
SELECT a.*
FROM Accounts a
WHERE NOT EXISTS (
SELECT *
FROM `Sales Orders` so
WHERE so.accountsid = a.accountsid
AND so.so_date >= '2016-01-01'
)
Using an outer JOIN:
SELECT a.*
FROM Accounts a LEFT JOIN `Sales Orders` so
ON so.accountsid = a.accountsid
AND so.so_date >= '2016-01-01'
WHERE so.accountsid IS NULL
why do you need to use only inner join? inner join is for cases you have data matching on two tables but in this case you don't you need to be using a subquery with either "not in" or "not exists"
What you want is to get the ids that didn´t make any order, so get the ids that made some order and the rest of them are the ones that didn´t make orders.
It should be something like this SELECT * FROM Accounts WHERE accountsid NOT IN (SELECT accountsid FROM Sales Orders WHERE so_date > your_date)
Update #1: query gives me syntax error on Left Join line (running the query within the left join independently works perfectly though)
SELECT b1.company_id, ((sum(b1.credit)-sum(b1.debit)) as 'Balance'
FROM MyTable b1
JOIN CustomerInfoTable c on c.id = b1.company_id
#Filter for Clients of particular brand, package and active status
where c.brand_id = 2 and c.status = 2 and c.package_id = 3
LEFT JOIN
(
SELECT b2.company_id, sum(b2.debit) as 'Current_Usage'
FROM MyTable b2
WHERE year(b2.timestamp) = '2012' and month(b2.timestamp) = '06'
GROUP BY b2.company_id
)
b3 on b3.company_id = b1.company_id
group by b1.company_id;
Original Post:
I keep track of debits and credits in the same table. The table has the following schema:
| company_id | timestamp | credit | debit |
| 10 | MAY-25 | 100 | 000 |
| 11 | MAY-25 | 000 | 054 |
| 10 | MAY-28 | 000 | 040 |
| 12 | JUN-01 | 100 | 000 |
| 10 | JUN-25 | 150 | 000 |
| 10 | JUN-25 | 000 | 025 |
As my result, I want to to see:
| Grouped by: company_id | Balance* | Current_Usage (in June) |
| 10 | 185 | 25 |
| 12 | 100 | 0 |
| 11 | -54 | 0 |
Balance: Calculated by (sum(credit) - sum(debits))* - timestamp does not matter
Current_Usage: Calculated by sum(debits) - but only for debits in JUN.
The problem: If I filter by JUN timestamp right away, it does not calculate the balance of all time but only the balance of any transactions in June.
How can I calculate the current usage by month but the balance on all transactions in the table. I have everything working, except that it filters only the JUN results into the current usage calculation in my code:
SELECT b.company_id, ((sum(b.credit)-sum(b.debit))/1024/1024/1024/1024) as 'BW_remaining', sum(b.debit/1024/1024/1024/1024/28*30) as 'Usage_per_month'
FROM mytable b
#How to filter this only for the current_usage calculation?
WHERE month(a.timestamp) = 'JUN' and a.credit = 0
#Group by company in order to sum all entries for balance
group by b.company_id
order by b.balance desc;
what you will need here is a join with sub query which will filter based on month.
SELECT T1.company_id,
((sum(T1.credit)-sum(T1.debit))/1024/1024/1024/1024) as 'BW_remaining',
MAX(T3.DEBIT_PER_MONTH)
FROM MYTABLE T1
LEFT JOIN
(
SELECT T2.company_id, SUM(T2.debit) T3.DEBIT_PER_MONTH
FROM MYTABLE T2
WHERE month(T2.timestamp) = 'JUN'
GROUP BY T2.company_id
)
T3 ON T1.company_id-T3.company_id
GROUP BY T1.company_id
I havn't tested the query. The point here i am trying to make is how you can join your existing query to get usage per month.
alright, thanks to #Kshitij I got it working. In case somebody else is running into the same issue, this is how I solved it:
SELECT b1.company_id, ((sum(b1.credit)-sum(b1.debit)) as 'Balance',
(
SELECT sum(b2.debit)
FROM MYTABLE b2
WHERE b2.company_id = b1.company_id and year(b2.timestamp) = '2012' and month(b2.timestamp) = '06'
GROUP BY b2.company_id
) AS 'Usage_June'
FROM MYTABLE b1
#Group by company in order to add sum of all zones the company is using
group by b1.company_id
order by Usage_June desc;
I have the following table with messages:
+---------+---------+------------+----------+
| msg_id | user_id | m_date | m_time |
+-------------------+------------+----------+
| 1 | 1 | 2011-01-22 | 06:23:11 |
| 2 | 1 | 2011-01-23 | 16:17:03 |
| 3 | 1 | 2011-01-23 | 17:05:45 |
| 4 | 2 | 2011-01-22 | 23:58:13 |
| 5 | 2 | 2011-01-23 | 23:59:32 |
| 6 | 2 | 2011-01-24 | 21:02:41 |
| 7 | 3 | 2011-01-22 | 13:45:00 |
| 8 | 3 | 2011-01-23 | 13:22:34 |
| 9 | 3 | 2011-01-23 | 18:22:34 |
| 10 | 3 | 2011-01-24 | 02:22:22 |
| 11 | 3 | 2011-01-24 | 13:12:00 |
+---------+---------+------------+----------+
What I want is for each day, to see how many messages each user has sent BEFORE and AFTER 16:00:
SELECT
user_id,
m_date,
SUM(m_time <= '16:00') AS before16,
SUM(m_time > '16:00') AS after16
FROM messages
GROUP BY user_id, m_date
ORDER BY user_id, m_date ASC
This produces:
user_id m_date before16 after16
-------------------------------------
1 2011-01-22 1 0
1 2011-01-23 0 2
2 2011-01-22 0 1
2 2011-01-23 0 1
2 2011-01-24 0 1
3 2011-01-22 1 0
3 2011-01-23 1 1
3 2011-01-24 2 0
Because user 1 has written no messages on 2011-01-24, this date is not in the resultset. However, this is undesirable. I have a second table in my database, called "date_range":
+---------+------------+
| date_id | d_date |
+---------+------------+
| 1 | 2011-01-21 |
| 1 | 2011-01-22 |
| 1 | 2011-01-23 |
| 1 | 2011-01-24 |
+---------+------------+
I want to check the "messages" against this table. For each user, all these dates have to be in the resultset. As you can see, none of the users have written messages on 2011-01-21, and as said, user 1 has no messages on 2011-01-24. The desired output of the query would be:
user_id d_date before16 after16
-------------------------------------
1 2011-01-21 0 0
1 2011-01-22 1 0
1 2011-01-23 0 2
1 2011-01-24 0 0
2 2011-01-21 0 0
2 2011-01-22 0 1
2 2011-01-23 0 1
2 2011-01-24 0 1
3 2011-01-21 0 0
3 2011-01-22 1 0
3 2011-01-23 1 1
3 2011-01-24 2 0
How can I link the two tables so that the query result also holds rows with zero values for before16 and after16?
Edit: yes, I have a "users" table:
+---------+------------+
| user_id | user_date |
+---------+------------+
| 1 | foo |
| 2 | bar |
| 3 | foobar |
+---------+------------+
Test bed:
create table messages (msg_id integer, user_id integer, _date date, _time time);
create table date_range (date_id integer, _date date);
insert into messages values
(1,1,'2011-01-22','06:23:11'),
(2,1,'2011-01-23','16:17:03'),
(3,1,'2011-01-23','17:05:05');
insert into date_range values
(1, '2011-01-21'),
(1, '2011-01-22'),
(1, '2011-01-23'),
(1, '2011-01-24');
Query:
SELECT p._date, p.user_id,
coalesce(m.before16, 0) b16, coalesce(m.after16, 0) a16
FROM
(SELECT DISTINCT user_id, dr._date FROM messages m, date_range dr) p
LEFT JOIN
(SELECT user_id, _date,
SUM(_time <= '16:00') AS before16,
SUM(_time > '16:00') AS after16
FROM messages
GROUP BY user_id, _date
ORDER BY user_id, _date ASC) m
ON p.user_id = m.user_id AND p._date = m._date;
EDIT:
Your initial query is left as is, I hope it doesn't requires any explanations;
SELECT DISTINCT user_id, dr._date FROM messages m, date_range dr will return a cartesian or CROSS JOIN of two tables, which will give me all required date range for each user in subject. As I'm interested in each pair only once, I use DISTINCT clause. Try this query with and without it;
Then I use LEFT JOIN on two sub-selects.
This join means: first, INNER join is performed, i.e. all rows with matching fields in the ON condition are returned. Then, for each row in the left-side relation of the join that has no matches on the right side, return NULLs (thus the name, LEFT JOIN, i.e. left relation is always there and right is expected to have NULLs). This join will do what you expect — return user_id + date combinations even if there were no messages in the given date for a given user. Note that I use user_id + date sub-select first (on the left) and messages query second (on the right);
coalesce() is used to replace NULL with zero.
I hope this clarifies how this query works.
Give this a shot:
select u.user_id, u._date,
sum(_time <= '16:00') as before16,
sum(_time > '16:00') as after16
from (
select m.user_id, d._date
from messages m
cross join date_range d
group by m.user_id, d._date
) u
left join messages m on u.user_id=m.user_id
and u._date=m._date
group by u.user_id, u._date
The inner query is just building a set of all possible/desired user-date pairs. It would be more efficient to use a users table, but you didn't mention that you had one, so I won't assume. otherwise, you just need the left join to not remove the non-joined records.
EDIT
--More detailed explanation: taking the query apart.
Start with the innermost query; the goal is to get a list of all desired dates for every user. Since there's a table of users and a table of dates it can look like this:
select distinct u.user_id, d.d_date
from users u
cross join date_range d
The key here is the cross join, taking every row in the users table and associating it with every row in the date_range table. The distinct keyword is really just a shorthand for a group by on all columns, and is here just in case there's duplicated data.
Note that there are several other methods of getting this same result set (like in my original query), but this is probably the simplest from both a logical and computational standpoint.
Really, the only other steps are to add the left join (associating all of the rows we got above to all available data, and not removing anything that doesn't have any data) and the group by and select components which are basically the same as you had before. So, putting everything together it looks like this:
select t.user_id, t.d_date,
sum(m.m_time <= '16:00') as before16,
sum(m.m_time > '16:00') as after16
from (
select distinct u.user_id, d.d_date
from users u
cross join date_range d
) t
left join messages m on t.user_id = m.user_id
and t.d_date = m.m_date
group by t.user_id, t.d_date
Based on some other comments/questions, note the explicit use of prefixes for all uses of all tables and sub-queries (which is pretty straight forward since we're not using any table more than once anymore): u for the users table, d for the date_range table, t for the sub-query containing the dates to use for each user, and m for the message table. This is probably where my first explanation fell a little short, since I used the message table twice, both times with the same prefix. It works there because of the context of both uses (one was in a sub-query), but it probably isn't the best practice.
It is not neat. But if you have a user table. Then maybe something like this:
SELECT
user_id,
_date,
SUM(_time <= '16:00') AS before16,
SUM(_time > '16:00') AS after16
FROM messages
GROUP BY user_id, _date
UNION
SELECT
user_id,
date_range,
0 AS before16,
0 AS after16
FROM
users,
date_range
ORDER BY user_id, _date ASC
chezy525's solution works great, I ported it to postgresql and removed/renamed some aliases:
select users_and_dates.user_id, users_and_dates._date,
SUM(case when _time <= '16:00' then 1 else 0 end) as before16,
SUM(case when _time > '16:00' then 1 else 0 end) as after16
from (
select messages.user_id, date_range._date
from messages
cross join date_range
group by messages.user_id, date_range._date
) users_and_dates
left join messages on users_and_dates.user_id=messages.user_id
and users_and_dates._date=messages._date
group by users_and_dates.user_id, users_and_dates._date;
and ran on my machine, worked perfectly