I am learning MySQL and saw a project related to e-commerce and customer behaviour and want to follow it along.
However, when calculating the number of unique customers retained on the second day, the original author used a different approach and got a different result.
The code is below:
select count(distinct user_id) as first_day_customer_num from userbehavior
where date = '2017-11-25';-- 359 unique customers counted and retained on the first day
select count(distinct user_id) as second_day_customer_num from userbehavior
where date = '2017-11-26' and user_id in (SELECT user_id FROM userbehavior
WHERE date = '2017-11-25');-- 295 unique customers counted and retained on the second day
I used the between method for date and here is my code below to calculate the number of unique customers retained on the second day:
select count(distinct user_id) as trial from userbehavior
where date between '2017-11-25' and '2017-11-26'; -- 450 unique customers counted
Could I ask why is our result different and which part did I do wrong?
Thank you so much for your help and support, really appreciate it.
Related
I have the following two tables:
movie_sales (provided daily)
movie_id
date
revenue
movie_rank (provided every few days or weeks)
movie_id
date
rank
The tricky thing is that every day I have data for sales, but only data for ranks once every few days. Here is an example of sample data:
`movie_sales`
- titanic (ID), 2014-06-01 (date), 4.99 (revenue)
- titanic (ID), 2014-06-02 (date), 5.99 (revenue)
`movie_rank`
- titanic (ID), 2014-05-14 (date), 905 (rank)
- titanic (ID), 2014-07-01 (date), 927 (rank)
And, because the movie_rate.date of 2014-05-14 is closer to the two sales dates, the output should be:
id date revenue closest_rank
titanic 2014-06-01 4.99 905
titanic 2014-06-02 5.99 905
The following query works to get the results by getting the min date difference in the sub-select:
SELECT
id,
date,
revenue,
(SELECT rank from movie_rank where id=s.id ORDER BY ABS(DATEDIFF(date, s.date)) ASC LIMIT 1)
FROM
movie_sales s
But I'm afraid that this would have terrible performance as it will literally be doing millions of subselects...on millions of rows. What would be a better way to do this, or is there really no proper way to do this since an index can not be properly done with a DATEDIFF ?
Unfortunately, you are right. The movie rank table must be searched for each movie sale and of all matching movie rows the closest be picked.
With an index on movie_rank(id) the DBMS finds the movie rows quickly, but an index on movie_rank(id, date) would be better, because the date could be read from the index and only the one best match would be read from the table.
But you also say that there are new ranks every few dates. If it is guaranteed to find a rank in a certain range, e.g. for each date there will be at least one rank in the twenty days before and at least one rank in the twenty days after, you can limit the search accordingly. (The index on movie_rank(id, date) would be essential for this, though.)
SELECT
id,
date,
revenue,
(
select r.rank
from movie_rank r
where r.id = s.id
and r.date between s.date - interval 20 days
and s.date + interval 20 days
order by abs(datediff(date, s.date)) asc
limit 1
)
FROM movie_sales s;
This is difficult to get quick with SQL. In a programming language I would choose this algorithm:
Sort the two tables by date and point to the first rows.
Move the rank pointer forward until we match the sales date or are beyond it. (If we aren't there already.)
Compare the sales date with the rank date we are pointing at and with the rank date of the previous row. Take the closer one.
Move the sales pointer one row forward.
Go to 2.
With this algorithm we would already be in about the position we want to be. Let's see, if we can do the same with SQL. Iterations are done with recursive queries in SQL. These are available in MySQL as of version 8.0.
We start with sorting the rows, i.e. giving them numbers. Then we iterate through both data sets.
with recursive
sales as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_sales
),
ranks as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_rank
),
cte (movie_id, revenue, srn, rrn, sdate, rdate, rrank, closest_rank) as
(
select
movie_id, s.revenue, s.rn, r.rn, s.date, r.date, r.ranking,
case when s.date <= r.date then r.ranking end
from (select * from sales where rn = 1) s
join (select * from ranks where rn = 1) r using (movie_id)
union all
select
cte.movie_id,
cte.revenue,
coalesce(s.rn, cte.srn),
coalesce(r.rn, cte.rrn),
coalesce(s.date, cte.sdate),
coalesce(r.date, cte.rdate),
coalesce(r.ranking, cte.rrank),
case when coalesce(r.date, cte.rdate) >= coalesce(s.date, cte.sdate) then
case when abs(datediff(coalesce(r.date, cte.rdate), coalesce(s.date, cte.sdate))) <
abs(datediff(cte.rdate, coalesce(s.date, cte.sdate)))
then coalesce(r.ranking, cte.rrank)
else cte.rrank
end
end
from cte
left join sales s on s.movie_id = cte.movie_id and s.rn = cte.srn + 1 and cte.closest_rank is not null
left join ranks r on r.movie_id = cte.movie_id and r.rn = cte.rrn + 1 and cte.rdate < cte.sdate
where s.movie_id is not null or r.movie_id is not null
-- where cte.closest_rank is null
)
select
movie_id,
sdate,
revenue,
closest_rank
from cte
where closest_rank is not null;
(BTW: I named the column ranking, because rank is a reserved word in SQL.)
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=e994cb56798efabc8f7249fd8320e1cf
This is probably still slow. The reason for this is: there are no pointers to a row in SQL. If we want to go from row #1 to row #2, we must search that row, while in a programming language we would really just move the pointer one step forward. If the tables had an ID, we could build a chain (next_row_id) instead of using row numbers. That could speed this process up. But well, I guess you already notice: this is not an algorithm made for SQL.
Another approach... Avoid the problem by cleansing the data.
Make sure the rank is available for every day. When a new date comes in, find the previous rank, then fill in all the rows for the intervening days.
(This will take some initial effort to 'fix' all the previous missing dates. After that, it is a small effort when a new list of ranks comes in.)
The "report" would be a simple JOIN on the date. You would probably need a 2-column INDEX(movie_id, date) or something like that.
Ultimate solution would be not to calculate all the ranks every time, but store them (in a new column, or even in a new table if you don't want to change existing tables).
Each time you update you could look for sales data without rank and calculate only for those.
With above approach you get rank always from last available rank BEFORE sales data (e.g. if you've data 14 days before and 1 days after, still the one before would be used)
If you strictly need to use ranking closest in time, then you need to run UPDATE also for newly arrived ranking info. I believe it would still be more efficient in the long run.
Apologies in advance for the vagueness of the title. This is an issue that is stumping me and I struggled to get any more specific.
First of all, to help visualise my problem I've uploaded a photo of my database to http://imgur.com/a/rTyn8.
Basically, I've been adding up payments in my database and have run into a complex (contextually to my understanding of MySQL, anyway, which is mediocre at best) problem.
I want to calculate the number of times any given customer (customer_id) has a job_id payment of both 17 & 12 in one day. If they do, I then want to calculate the added cost of them. However, I'd like to run this query throughout the whole database between 2 specific dates (eg. 2016-01-01 -- 2016-05-06) and generate the total income during this period.
In the picture I link to above, the customer with a customer_id of 1658 has two payments - one of them with a job_id of 12, one of them 17. Therefore, I would like to add the the cost of both these (6.00 + 19.80) together, as well as anyone else who falls under this criteria, and come to a total figure.
Just to clarify, the customer (with a customer_id of 1913) below the rows I refer to would also fall under this category.
I've tried my best at getting something together, but admittedly I'm completely lost.
Thanks in advance,
Liam
Join the table to itself, once for each job type:
select
count(*) quantity,
sum(a.cost + b.cost) total
from mytable a
join mytable b on b.customer_id = a.customer_id
and a.date = b.date
where a.date between '2016-01-01' and '2016-05-06'
and a.job_id = 17
and b.job_id = 12
If you want a breakdown by customer_id, add a.customer_id to the selected columns and add group by customer_id.
I am developing a php/mysql database.
I have a table called ‘actions’ which (amongst others) contains fields hrs, mins, actiondate, invoiceid and staffid.
For any particular actiondate there could be any number of actions carried out by various staff who would enter their time as hrs and mins.
What I need to do is produce a table which for each date and for a specific member of staff and invoice, adds up all of the hrs and mins for each date as a decimal, rounds it up to the nearest quarter and displays that result. I also need to be able to add up all of those results and display that total.
For example, if on March 1st, person with staffid=23 had carried out 4 actions for invoiced 121 lasting, 1h2m, 23m, 10m and 20m the total for that day would be 62+23+10+20 = 115m = 115/60 = 1.92 which would be rounded up to 2.00.
I can get each day’s total (maybe not very elegantly) and display it against the date using the code below
SELECT actions.`actiondate`,
(FORMAT((((CEIL((((60*SUM(hrs))+SUM(mins))/60)*4))/4)),2)) AS dayfeeqtr
FROM actions
WHERE staff.staffid=’23’
AND invoiceid=‘121’
GROUP BY actions.`actiondate`
However, what I can’t work out, is how can I add up all of these rounded up results for that invoice and that member of staff.
Can anyone help please?
If I understand correctly, you can use a subquery:
SELECT sum(dayfeeqtr)
FROM (SELECT a.`actiondate`,
FORMAT((((CEIL((((60*SUM(hrs))+SUM(mins))/60)*4))/4)), 2) AS dayfeeqtr
FROM actions a
WHERE s.staffid = '23' AND invoiceid = '121'
GROUP BY a.`actiondate`
) a;
I do note that your query is not correct -- for instance, there is a reference to staff, which is not in a from clause. However, you say that this is working, so I assume the errors are a transcription problem.
I have this fact table here, I would like using this table to list group by year and have the total number of patients that have PatientType_id = 1101.
Example:
2012 5
2012 8
The Date_DateKey is actually the date 2012-03-14. I've managed to list the total patients with typeID 1101 for a single year, but I don't know how is possible to list all the years. Could you give me some hints please?
And here's the Date dimension
Normally, a key column in a fact table would reference another table. So, you should have a date/calendar table somewhere with information like the year. That would be the proper way to get this information.
I discourage you from parsing key values in general. In this case, with the information you have provided, it seems to be the only solution:
select floor(date_datekey / 10000) as year, count(distinct patient_id)
from table t
where PatientType_id = 1101
group by floor(date_datekey / 10000)
Try this:
SELECT LEFT(DateKe,4) as Year, COUNT(patient_id) FROM Table WHERE PatientType_id = 1101 GROUP BY LEFT(DateKe,4)
What is the best way to think about the Group By function in MySQL?
I am writing a MySQL query to pull data through an ODBC connection in a pivot table in Excel so that users can easily access the data.
For example, I have:
Select
statistic_date,
week(statistic_date,4),
year(statistic_date),
Emp_ID,
count(distict Emp_ID),
Site
Cost_Center
I'm trying to count the number of unique employees we have by site by week. The problem I'm running into is around year end, the calendar years don't always match up so it is important to have them by date so that I can manually filter down to the correct dates using a pivot table (2013/2014 had a week were we had to add week 53 + week 1).
I'm experimenting by using different group by statements but I'm not sure how the order matters and what changes when I switch them around.
i.e.
Group by week(statistic_date,4), Site, Cost_Center, Emp_ID
vs
Group by Site, Cost_Center, week(statistic_date,4), Emp_ID
Other things to note:
-Employees can work any number of days. Some are working 4 x 10's, others 5 x 8's with possibly a 6th day if they sign up for OT. If I sum the counts by week, I get anywhere between 3-7 per Emp_ID. I'm hoping to get 1 for the week.
-There are different pay code per employee so the distinct count helps when we are looking by day (VTO = Voluntary Time Off, OT = Over Time, LOA = Leave of Absence, etc). The distinct count will show me 1, where often times I will have 2-3 for the same emp in the same day (hits 40 hours and starts accruing OT then takes VTO or uses personal time in the same day).
I'm starting with a query I wrote to understand our paid hours by week. I'm trying to adapt it for this application. Actual code is below:
SELECT
dkh.STATISTIC_DATE AS 'Date'
,week(dkh.STATISTIC_DATE,4) as 'Week'
,month(dkh.STATISTIC_DATE) as 'Month'
,year(dkh.STATISTIC_DATE) as 'Year'
,dkh.SITE AS 'Site ID Short'
,aep.LOC_DESCR as 'Site Name'
,dkh.EMPLOYEE_ID AS 'Employee ID'
,count(distinct dkh.EMPLOYEE_ID) AS 'Distinct Employee ID'
,aep.NAME AS 'Employee Name'
,aep.BUSINESS_TITLE AS 'Business_Ttile'
,aep.SPRVSR_NAME AS 'Manager'
,SUBSTR(aep.DEPTID,1,4) AS 'Cost_Center'
,dkh.PAY_CODE
,dkh.PAY_CODE_SHORT
,dkh.HOURS
FROM metrics.DAT_KRONOS_HOURS dkh
JOIN metrics.EMPLOYEES_PUBLIC aep
ON aep.SNAPSHOT_DATE = SUBDATE(dkh.STATISTIC_DATE, DAYOFWEEK(dkh.STATISTIC_DATE) + 1)
AND aep.EMPLID = dkh.EMPLOYEE_ID
WHERE dkh.STATISTIC_DATE BETWEEN adddate(now(), interval -1 year) AND DATE(now())
group by dkh.SITE, SUBSTR(aep.DEPTID,1,4), week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE, dkh.EMPLOYEE_ID
The order you use in group by doesn't matter. Each unique combination of the values gets a group of its own. Selecting columns you don't group by gives you somewhat arbitrary results; you'd probably want to use some aggregation function on them, such as SUM to get the group total.
Grouping by values you derive from other values that you already use in group by, like below, isn't very useful.
week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE
If two rows have different weeks, they'll also have different dates, right?