Tricky Rails3/mysql query - mysql

In rails 3 (also with meta_where gem if you feel like using it in your query), I got a really tricky query that I have been banging my head for:
Suppose I have two models, customers and purchases, customer have many purchases. Let's define customers with at least 2 purchases as "repeat_customer". I need to find the total number of repeat_customers by each day for the past 3 months, something like:
Date TotalRepeatCustomerCount
1/1/11 10 (10 repeat customers by the end of 1/1/11)
1/2/11 15 (5 more customer gained "repeat" status on this date)
1/3/11 16 (1 more customer gained "repeat" status on this date)
...
3/30/11 150
3/31/11 160
Basically I need to group customer count based on the date of creation of their second purchase, since that is when they "gain repeat status".
Certainly this can be achieved in ruby, something like:
Customer.includes(:purchases).all.select{|x| x.purchases.count >= 2 }.group_by{|x| x.purchases.second.created_at.to_date }.map{|date, customers| [date, customers.count]}
However, the above code will fire query on the same lines of Customer.all and Purchase.all, then do a bunch of calculation in ruby. I would much prefer doing selection, grouping and calculations in mysql, since it is not only much faster, it also reduces the bandwith from the database. In large databases, the code above is basically useless.
I have been trying for a while to conjure up the query in rails/active_record, but have no luck even with the nice meta_where gem. If I have to, I will accept a solution in pure mysql query as well.
Edited: I would cache it (or add a "repeat" field to customers), though only for this simplified problem. The criteria for repeat customer can change by the client at any point (2 purchases, 3 purchases, 4 purchases etc), so unfortunately I do have to calculate it on the spot.

SELECT p_date, COUNT(customers.id) FROM
(
SELECT p_date - INTERVAL 1 day p_date, customers.id
FROM
customers NATURAL JOIN purchases
JOIN (SELECT DISTINCT date(purchase_date) p_date FROM purchases) p_dates
WHERE purchases.purchase_date < p_date
GROUP BY p_date, customers.id
HAVING COUNT(purchases.id) >= 2
) a
GROUP BY p_date
I didn't test this in the slightest, so I hope it works. Also, I hope I understood what you are trying to accomplish.
But please note that you should not do this, it'll be too slow. Since the data never changes once the day is passed, just cache it for each day.

Related

How to count days between 2 dates except holiday and weekend?

I started a HR management project and I want to count days between 2 dates without counting the holidays and weekends. So the HR can count employee's day off
Here's the case, I want to count between 2018-02-14 and 2018-02-20 where there is an office holiday on 2018-02-16. The result should be 3 days.
I have already created a table called tbl_holiday where I put all weekends and holidays in one year there
I found this post, and I tried it on my MariaDB
Here's my query:
SELECT 5 * (DATEDIFF('2018-02-20', '2018-02-14') DIV 7) +
MID('0123444401233334012222340111123400012345001234550', 7 *
WEEKDAY('2018-02-14') + WEEKDAY('2018-02-20') + 1, 1) -
(SELECT COUNT(dates) FROM tbl_holiday WHERE dates NOT IN (SELECT dates FROM tbl_holiday)) as Days
The query works but the result is 4 days, not 3 days. It means the query only exclude the weekends but not the holiday
What is wrong with my query? Am I missing something? Thank you for helping me
#RichardDoe, from the question comments.
In a reasonable implementation of a date table, you create a list of all days (covering a sufficient range to cope with any query you may run against it - 15 years each way from today is probably a useful minimum), and alongside each day you store a variety of derived attributes.
I wrote a Q&A recently with basic tools that would get you started in SQL Server: https://stackoverflow.com/a/48611348/9129668
Unfortunately I don't have a MySQL environment or intimate familiarity with it to allow me to write or adapt queries off the top of my head (as I'm doing here), but I hope this will illustrate the structure of a solution for you in SQL Server syntax.
In terms of the answer I link to (which generates a date table on the fly) and extending it by adding in your holiday table (and making some inferences about how you've defined your holiday table), and noting that a working day is any day Mon-Fri that isn't a holiday, you'd write a query like so to get the number of working days between any two dates:
WITH
dynamic_date_table AS
(
SELECT *
FROM generate_series_datetime2('2000-01-01','2030-12-31',1)
CROSS APPLY datetime2_params_fxn(datetime2_value)
)
,date_table_ext1 AS
(
SELECT
ddt.*
,IIF(hol.dates IS NOT NULL, 1, 0) AS is_company_holiday
FROM
dynamic_date_table AS ddt
LEFT JOIN
tbl_holiday AS hol
ON (hol.dates = ddt.datetime2_value)
)
,date_table_ext2 AS
(
SELECT
*
,IIF(is_weekend = 1 OR is_company_holiday = 1, 0, 1) AS is_company_work_day
FROM date_table_ext1
)
SELECT
COUNT(datetime2_value)
FROM
date_table_ext2
WHERE
(datetime2_value BETWEEN '2018-02-14' AND '2018-02-20')
AND
(is_company_work_day = 1)
Obviously, the idea for a well-factored solution is that these intermediate calculations (being general in nature to the entire company) get rolled into the date_params_fxn, so that any query run against the database gains access to the pre-defined list of company workdays. Queries that are run against it then start to resemble plain English (rather than the approach you linked to and adapted in your question, which is ingenious but far from clear).
If you want top performance (which will be relevant if you are hitting these calculations heavily) then you define appropriate parameters, save the lot into a stored date table, and index that table appropriately. This way, your query would become as simple as the final part of the query here, but referencing the stored date table instead of the with-block.
The sequentially-numbered workdays I referred to in my comment on your question, are another step again for the efficiency and indexability of certain types of queries against a date table, but I won't complicate this answer any further for now. If any further clarification is required, please feel free to ask.
I found the answer for this problem
It turns out, I just need to use a simple arithmetic operator for this problem
SELECT (SELECT DATEDIFF('2018-02-20', '2018-02-14')) - (SELECT COUNT(id) FROM tbl_holiday WHERE dates BETWEEN '2018-02-14' AND '2018-02-20');

How to limit result set to only the latest instance in the JOIN

I'm creating a sales pipeline report, where I capture the wins and losses for for each sales person each week.
The report works for the most part, except for this corner case that bugs me. It wouldn't typically occur, but if a sales person moves an opportunity to a win status, then back to a loss status, then again to a win status - it will count as 2 wins. I am looking for some way to only get the latest row from the audit (detail) table in which (a) the date is within the last week, (b) the after_value is a loss or win or loss value.
I have tried doing this as much as possible in the join, like so:
FROM
opportunities o ON ao.opportunity_id=o.id
LEFT JOIN opportunities_audit oa ON o.id=oa.parent_id
AND after_value_string IN ('Loss', 'Win')
AND date_created > date_sub(now(), INTERVAL 1 WEEK)
INNER JOIN sweet.users u ON o.assigned_user_id=u.id
but I haven't found a way to use something like MAX(id) in the join. I also tried a MAX(id) in the SELECT, but I have several sum(IF) statements, and I didn't think it made sense to have to do it for every sum(IF) - plus I couldn't figure out how to make it work for just one of them anyway.
I keep going to MAX, or maybe a subquery to join the table to itself and get the MAX(id) that way, but I just haven't figured out where to put the subquery, since I don't want every SELECT to use it. And if that is in fact even the best solution. Oh, AND, the id in these tables look like hash values, so I don't know if MAX would work anyway. Le sigh.
Here's just part of the SELECT, in case it helps:
, sum(IF(o.sales_stage = 'Win'
AND (o.date_modified > date_sub(now(), INTERVAL 1 WEEK))
, 1,0))
AS 'W'
I hope I've given enough information, any direction/advice would be much appreciated!
Thanks!
select Top number|percentage at the beginning from the select

How to detect if cell value appears twice in one day

Apologies in advance for the vagueness of the title. This is an issue that is stumping me and I struggled to get any more specific.
First of all, to help visualise my problem I've uploaded a photo of my database to http://imgur.com/a/rTyn8.
Basically, I've been adding up payments in my database and have run into a complex (contextually to my understanding of MySQL, anyway, which is mediocre at best) problem.
I want to calculate the number of times any given customer (customer_id) has a job_id payment of both 17 & 12 in one day. If they do, I then want to calculate the added cost of them. However, I'd like to run this query throughout the whole database between 2 specific dates (eg. 2016-01-01 -- 2016-05-06) and generate the total income during this period.
In the picture I link to above, the customer with a customer_id of 1658 has two payments - one of them with a job_id of 12, one of them 17. Therefore, I would like to add the the cost of both these (6.00 + 19.80) together, as well as anyone else who falls under this criteria, and come to a total figure.
Just to clarify, the customer (with a customer_id of 1913) below the rows I refer to would also fall under this category.
I've tried my best at getting something together, but admittedly I'm completely lost.
Thanks in advance,
Liam
Join the table to itself, once for each job type:
select
count(*) quantity,
sum(a.cost + b.cost) total
from mytable a
join mytable b on b.customer_id = a.customer_id
and a.date = b.date
where a.date between '2016-01-01' and '2016-05-06'
and a.job_id = 17
and b.job_id = 12
If you want a breakdown by customer_id, add a.customer_id to the selected columns and add group by customer_id.

Finding the sum of a set of calculated sums

I am developing a php/mysql database.
I have a table called ‘actions’ which (amongst others) contains fields hrs, mins, actiondate, invoiceid and staffid.
For any particular actiondate there could be any number of actions carried out by various staff who would enter their time as hrs and mins.
What I need to do is produce a table which for each date and for a specific member of staff and invoice, adds up all of the hrs and mins for each date as a decimal, rounds it up to the nearest quarter and displays that result. I also need to be able to add up all of those results and display that total.
For example, if on March 1st, person with staffid=23 had carried out 4 actions for invoiced 121 lasting, 1h2m, 23m, 10m and 20m the total for that day would be 62+23+10+20 = 115m = 115/60 = 1.92 which would be rounded up to 2.00.
I can get each day’s total (maybe not very elegantly) and display it against the date using the code below
SELECT actions.`actiondate`,
(FORMAT((((CEIL((((60*SUM(hrs))+SUM(mins))/60)*4))/4)),2)) AS dayfeeqtr
FROM actions
WHERE staff.staffid=’23’
AND invoiceid=‘121’
GROUP BY actions.`actiondate`
However, what I can’t work out, is how can I add up all of these rounded up results for that invoice and that member of staff.
Can anyone help please?
If I understand correctly, you can use a subquery:
SELECT sum(dayfeeqtr)
FROM (SELECT a.`actiondate`,
FORMAT((((CEIL((((60*SUM(hrs))+SUM(mins))/60)*4))/4)), 2) AS dayfeeqtr
FROM actions a
WHERE s.staffid = '23' AND invoiceid = '121'
GROUP BY a.`actiondate`
) a;
I do note that your query is not correct -- for instance, there is a reference to staff, which is not in a from clause. However, you say that this is working, so I assume the errors are a transcription problem.

MySQL Group By Order and Count(Distinct)

What is the best way to think about the Group By function in MySQL?
I am writing a MySQL query to pull data through an ODBC connection in a pivot table in Excel so that users can easily access the data.
For example, I have:
Select
statistic_date,
week(statistic_date,4),
year(statistic_date),
Emp_ID,
count(distict Emp_ID),
Site
Cost_Center
I'm trying to count the number of unique employees we have by site by week. The problem I'm running into is around year end, the calendar years don't always match up so it is important to have them by date so that I can manually filter down to the correct dates using a pivot table (2013/2014 had a week were we had to add week 53 + week 1).
I'm experimenting by using different group by statements but I'm not sure how the order matters and what changes when I switch them around.
i.e.
Group by week(statistic_date,4), Site, Cost_Center, Emp_ID
vs
Group by Site, Cost_Center, week(statistic_date,4), Emp_ID
Other things to note:
-Employees can work any number of days. Some are working 4 x 10's, others 5 x 8's with possibly a 6th day if they sign up for OT. If I sum the counts by week, I get anywhere between 3-7 per Emp_ID. I'm hoping to get 1 for the week.
-There are different pay code per employee so the distinct count helps when we are looking by day (VTO = Voluntary Time Off, OT = Over Time, LOA = Leave of Absence, etc). The distinct count will show me 1, where often times I will have 2-3 for the same emp in the same day (hits 40 hours and starts accruing OT then takes VTO or uses personal time in the same day).
I'm starting with a query I wrote to understand our paid hours by week. I'm trying to adapt it for this application. Actual code is below:
SELECT
dkh.STATISTIC_DATE AS 'Date'
,week(dkh.STATISTIC_DATE,4) as 'Week'
,month(dkh.STATISTIC_DATE) as 'Month'
,year(dkh.STATISTIC_DATE) as 'Year'
,dkh.SITE AS 'Site ID Short'
,aep.LOC_DESCR as 'Site Name'
,dkh.EMPLOYEE_ID AS 'Employee ID'
,count(distinct dkh.EMPLOYEE_ID) AS 'Distinct Employee ID'
,aep.NAME AS 'Employee Name'
,aep.BUSINESS_TITLE AS 'Business_Ttile'
,aep.SPRVSR_NAME AS 'Manager'
,SUBSTR(aep.DEPTID,1,4) AS 'Cost_Center'
,dkh.PAY_CODE
,dkh.PAY_CODE_SHORT
,dkh.HOURS
FROM metrics.DAT_KRONOS_HOURS dkh
JOIN metrics.EMPLOYEES_PUBLIC aep
ON aep.SNAPSHOT_DATE = SUBDATE(dkh.STATISTIC_DATE, DAYOFWEEK(dkh.STATISTIC_DATE) + 1)
AND aep.EMPLID = dkh.EMPLOYEE_ID
WHERE dkh.STATISTIC_DATE BETWEEN adddate(now(), interval -1 year) AND DATE(now())
group by dkh.SITE, SUBSTR(aep.DEPTID,1,4), week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE, dkh.EMPLOYEE_ID
The order you use in group by doesn't matter. Each unique combination of the values gets a group of its own. Selecting columns you don't group by gives you somewhat arbitrary results; you'd probably want to use some aggregation function on them, such as SUM to get the group total.
Grouping by values you derive from other values that you already use in group by, like below, isn't very useful.
week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE
If two rows have different weeks, they'll also have different dates, right?