What is difference for below there two sql queries - mysql

select
substr(insert_date, 1, 14),
device, count(1)
from
abc.xyztable
where
insert_date >= DATE_SUB(NOW(), INTERVAL 10 DAY)
group by
device, substr(insert_date, 1, 14) ;
and then I am trying to get average of the same rows count which I got above.
SELECT
date, device, AVG(count)
FROM
(SELECT
substr(insert_date, 1, 14) AS date,
device,
COUNT(1) AS count
FROM
abc.xyztable
WHERE
insert_date >= DATE_SUB(NOW(), INTERVAL 10 DAY)
GROUP BY
device, substr(insert_date, 1, 14)) a
GROUP BY
device, date;
AS I found both queries return the same results, I tried for last 10 days data.
My purpose is to get the average rows count for last 10 days which I get from the above 1st query.

I'm not entirely sure what you're asking, the "difference" between the two queries is that the first one is valid but the second does not appear to be, as per HoneyBadger's comment. They also seem to be trying to achieve two different goals.
However, I think what you are trying to do is produce a query based on the data from the first query, which returns the date, device, and an average of the count column. If so, I believe the following query would calculate this:
WITH
dataset AS (
select substr(insert_date,1,14) AS theDate, device, count(*) AS
theCount
from abc.xyztable
where insert_date >=DATE_SUB(NOW(), INTERVAL 10 DAY)
group by device,substr(insert_date,1,14)
)
SELECT theDate, device, (SELECT ROUND(AVG(CAST(theCount
AS FLOAT)), 2) FROM
dataset) AS Average
FROM dataset
GROUP BY theDate, device
I have referenced the accepted answers of this question to calculate the average: How to calculate average of a column and then include it in a select query in oracle?
And this question to tidy up the query: Formatting Clear and readable SQL queries
Without having a sample of your data, or any proper context, I can't see how this would be especially useful, so if it was not what you were looking for, please edit your question and clarify exactly what you need.
EDIT: Based on what extra information you have provided, I've made a tweak to my solution to increase the precision of the average column. It now calculates the average to two decimal places. You have stated that this returns the same result as your original query, but the two queries are not formulating the same thing. If the count column is consistently the same number with little variation, the AVG function will round this, which in turn could produce results which look the same, especially if you only compare a small sample, so I have amended my answer to demonstrate this. Again, we'd all be able to help you much easier if you would provide more information, such as a sample of your data.

If you want an average you need to change the last GROUP BY
to get an average per device
GROUP BY device;
to get an average per date
GROUP BY date;
or remove it completely to get an average for all rows in the sub-query
Update
Below is a full example for getting the average per device
SELECT device, avg(count)
FROM (SELECT substr(insert_date,1,14) as date, device, count(1) as count
FROM abc.xyztable
WHERE insert_date >=DATE_SUB(NOW(), INTERVAL 10 DAY)
GROUP BY device,substr(insert_date,1,14)) a
GROUP BY device;

Related

Efficient SQL Query to calculate portion of a row in half hourly time series that has occurred

I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;

SQL: Reuse function result in query without using sub-query

In a MySQL DB table that stores sale orders, I have a LastReviewed column that holds the last date and time when the sale order was modified (type timestamp, default value CURRENT_TIMESTAMP). I'd like to plot the number of sales that were modified each day, for the last 90 days, for a particular user.
I'm trying to craft a SELECT that returns the number of days since LastReviewed date, and how many records fall within that range. Below is my query, which works just fine:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND DATEDIFF(CURDATE(),LastReviewed)<=90
GROUP BY days
ORDER BY days ASC
Notice that I am computing the DATEDIFF() as well as CURDATE() multiple times for each record. This seems really ineffective, so I'd like to know how I can reuse the results of the previous computation. The first thing I tried was:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND days<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'where clause'. So I started to look around the net. Based on another discussion (Can I reuse a calculated field in a SELECT query?), I next tried the following:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND (SELECT days)<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'field list'. I'm also tried the following:
SELECT #days := DATEDIFF(CURDATE(), LastReviewed) AS days,
COUNT(*) AS number FROM sales
WHERE UserID=123 AND #days <=90
GROUP BY days
ORDER BY days ASC
The query returns zero result, so #days<=90 seems to return false even though if I put it in the SELECT clause and remove the WHERE clause, I can see some results with #days values below 90.
I've gotten things to work by using a sub-query:
SELECT * FROM (
SELECT DATEDIFF(CURDATE(),LastReviewed) AS sales ,
COUNT(*) AS number FROM sales
WHERE UserID=123
GROUP BY days
) AS t
WHERE days<=90
ORDER BY days ASC
However I odn't know whether it's the most efficient way. Not to mention that even this solution computes CURDATE() once per record even though its value will be the same from the start to the end of the query. Isn't that wasteful? Am I overthinking this? Help would be welcome.
Note: Mods, should this be on CodeReview? I posted here because the code I'm trying to use doesn't actually work
There are actually two problems with your question.
First, you're overlooking the fact that WHERE precedes SELECT. When the server evaluates WHERE <expression>, it then already knows the value of the calculations done to evaluate <expression> and can use those for SELECT.
Worse than that, though, you should almost never write a query that uses a column as an argument to a function, since that usually requires the server to evaluate the expression for each row.
Instead, you should use this:
WHERE LastReviewed < DATE_SUB(CURDATE(), INTERVAL 90 DAY)
The optimizer will see this and get all excited, because DATE_SUB(CURDATE(), INTERVAL 90 DAY) can be resolved to a constant, which can be used on one side of a < comparison, which means that if an index exists with LastReviewed as the leftmost relevant column, then the server can immediately eliminate all of the rows with LastReviewed >= that constant value, using the index.
Then DATEDIFF(CURDATE(), LastReviewed) AS days (still needed for SELECT) will only be evaluated against the rows we already know we want.
Add a single index on (UserID, LastReviewed) and the server will be able to pinpoint exactly the relevant rows extremely quickly.
Builtin functions are much less costly than, say, fetching rows.
You could get a lot more performance improvement with the following 'composite' index:
INDEX(UserID, LastReviewed)
and change to
WHERE UserID=123
AND LastReviewed >= CURRENT_DATE() - INTERVAL 90 DAY
Your formulation is 'hiding' LastRevieded in a function call, making it unusable in an index.
If you are still not satisfied with that improvement, then consider a nightly query that computes yesterday's statistics and puts them in a "Summary table". From there, the SELECT you mentioned can run even faster.

MySQL cumulative sum grouped by date

I know there have been a few posts related to this, but my case is a little bit different and I wanted to get some help on this.
I need to pull some data out of the database that is a cumulative count of interactions by day. currently this is what i have
SELECT
e.Date AS e_date,
count(e.ID) AS num_interactions
FROM example AS e
JOIN example e1 ON e1.Date <= e.Date
GROUP BY e.Date;
The output of this is close to what I want but not exactly what I need.
The problem I'm having is the dates are stored with the hour minute and second that the interaction happened, so the group by is not grouping days together.
This is what the output looks like.
On 12-23 theres 5 interactions but its not grouped because the time stamp is different. So I need to find a way to ignore the timestamp and just look at the day.
If I try GROUP BY DAY(e.Date) it groups the data by the day only (i.e everything that happened on the 1st of any month is grouped into one row) and the output is not what I want at all.
GROUP BY DAY(e.Date), MONTH(e.Date) is splitting it up by month and the day of the month, but again the count is off.
I'm not a MySQL expert at all so I'm puzzled on what i'm missing
New Answer
At first, I didn't understand you were trying to do a running total. Here is how that would look:
SET #runningTotal = 0;
SELECT
e_date,
num_interactions,
#runningTotal := #runningTotal + totals.num_interactions AS runningTotal
FROM
(SELECT
DATE(eDate) AS e_date,
COUNT(*) AS num_interactions
FROM example AS e
GROUP BY DATE(e.Date)) totals
ORDER BY e_date;
Original Answer
You could be getting duplicates because of your join. Maybe e1 has more than one match for some rows which is inflating your count. Either that or the comparison in your join is also comparing the seconds, which is not what you expect.
Anyhow, instead of chopping the datetime field into days and months, just strip the time from it. Here is how you do that.
SELECT
DATE(e.Date) AS e_date,
count(e.ID) AS num_interactions
FROM example AS e
JOIN example e1 ON DATE(e1.Date) <= DATE(e.Date)
GROUP BY DATE(e.Date);
I figured out what I needed to do last night... but since I'm new to this I couldn't post it then... what I did that worked was this:
SELECT
DATE(e.Date) AS e_date,
count(e.ID) AS num_daily_interactions,
(
SELECT
COUNT(id)
FROM example
WHERE DATE(Date) <= e_date
) as total_interactions_per_day
FROM example AS e
GROUP BY e_date;
Would that be less efficient than your query? I may just do the calculation in python after pulling out the count per day if its more efficient, because this will be on the scale of thousands to hundred of thousands of rows returned.

MySQL Group By Order and Count(Distinct)

What is the best way to think about the Group By function in MySQL?
I am writing a MySQL query to pull data through an ODBC connection in a pivot table in Excel so that users can easily access the data.
For example, I have:
Select
statistic_date,
week(statistic_date,4),
year(statistic_date),
Emp_ID,
count(distict Emp_ID),
Site
Cost_Center
I'm trying to count the number of unique employees we have by site by week. The problem I'm running into is around year end, the calendar years don't always match up so it is important to have them by date so that I can manually filter down to the correct dates using a pivot table (2013/2014 had a week were we had to add week 53 + week 1).
I'm experimenting by using different group by statements but I'm not sure how the order matters and what changes when I switch them around.
i.e.
Group by week(statistic_date,4), Site, Cost_Center, Emp_ID
vs
Group by Site, Cost_Center, week(statistic_date,4), Emp_ID
Other things to note:
-Employees can work any number of days. Some are working 4 x 10's, others 5 x 8's with possibly a 6th day if they sign up for OT. If I sum the counts by week, I get anywhere between 3-7 per Emp_ID. I'm hoping to get 1 for the week.
-There are different pay code per employee so the distinct count helps when we are looking by day (VTO = Voluntary Time Off, OT = Over Time, LOA = Leave of Absence, etc). The distinct count will show me 1, where often times I will have 2-3 for the same emp in the same day (hits 40 hours and starts accruing OT then takes VTO or uses personal time in the same day).
I'm starting with a query I wrote to understand our paid hours by week. I'm trying to adapt it for this application. Actual code is below:
SELECT
dkh.STATISTIC_DATE AS 'Date'
,week(dkh.STATISTIC_DATE,4) as 'Week'
,month(dkh.STATISTIC_DATE) as 'Month'
,year(dkh.STATISTIC_DATE) as 'Year'
,dkh.SITE AS 'Site ID Short'
,aep.LOC_DESCR as 'Site Name'
,dkh.EMPLOYEE_ID AS 'Employee ID'
,count(distinct dkh.EMPLOYEE_ID) AS 'Distinct Employee ID'
,aep.NAME AS 'Employee Name'
,aep.BUSINESS_TITLE AS 'Business_Ttile'
,aep.SPRVSR_NAME AS 'Manager'
,SUBSTR(aep.DEPTID,1,4) AS 'Cost_Center'
,dkh.PAY_CODE
,dkh.PAY_CODE_SHORT
,dkh.HOURS
FROM metrics.DAT_KRONOS_HOURS dkh
JOIN metrics.EMPLOYEES_PUBLIC aep
ON aep.SNAPSHOT_DATE = SUBDATE(dkh.STATISTIC_DATE, DAYOFWEEK(dkh.STATISTIC_DATE) + 1)
AND aep.EMPLID = dkh.EMPLOYEE_ID
WHERE dkh.STATISTIC_DATE BETWEEN adddate(now(), interval -1 year) AND DATE(now())
group by dkh.SITE, SUBSTR(aep.DEPTID,1,4), week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE, dkh.EMPLOYEE_ID
The order you use in group by doesn't matter. Each unique combination of the values gets a group of its own. Selecting columns you don't group by gives you somewhat arbitrary results; you'd probably want to use some aggregation function on them, such as SUM to get the group total.
Grouping by values you derive from other values that you already use in group by, like below, isn't very useful.
week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE
If two rows have different weeks, they'll also have different dates, right?

mysql returns junk value when querying for the week value

select (SELECT power FROM newdb.newmeter where date(dt)=curdate() order by dt desc limit 1)
-(select Power from newdb.newmeter where date(dt)=(select date(subdate(now(), interval weekday(now()) day))) limit 0,1) as difference;
The above query is part of my prog which gives me difference in data being stored from day 1 of the week to the current day of the week. Those queries individually works fine as below, and returns:
SELECT power FROM newdb.newmeter where date(dt)=curdate() order by dt desc limit 1;
result: 941690 current time
select Power from newdb.newmeter where date(dt)=(select date(subdate(now(), interval weekday(now()) day))) limit 0,1;
result 93242.4 at the start of the week (or day for today as its monday)
But as soon as I run the difference query which is just the difference between above two that result in : 848447.8515625
This seems just really strange don't understand whats wrong with it? Please help.
You don't order by dt in your second query, looking for power at the start of week. Which means you are selecting undefined record that happens to have matching date. For a simple select the table is usually sorted by insert order, but it can change when optimizer thinks it can run a query faster using some other order. Basically, if you don't define order you don't care about order.
How many decimal places do you want in the answer? Use DECIMAL(7,1) for Power (I assume you want one decimal place) instead and see what you get.
I tend to avoid floats/doubles as the approximate values can get you into trouble.