I have a table in mysql which contain posts/entries, these posts have creation date and categorized. What I want to do is get the trends of those categories, each category how is the trend in the past hour? by trend, I mean, the trend of posting.
Since you marked your question with the data-warehouse tag, you should probably have 2 dimensions. A date dimension for the day and then a time dimension for the hour, minute, seconds component of a date. If you have those two piece, you can simply run a query joining up your time dimension to your main fact grouping by hour.
select pc.category, t.hour, count(*)
from post_details pc, -- since you said they were categorized
posts p,
time_of_day t
where p.time_of_day_id = t.time_of_day_id
and p.post_details_id = pc.post_details_id
group by pc.category, t.hour;
Even if you don't have everything dimensionalized, you still should be able to extract the hour of the date the entry was posted and do a group by.
select p.category, extract(hours from p.post_date), count(*)
from posts p;
Related
I'm pretty new to SQL and I'm struggling with one of the questions on my exercise. How would I calculate average session length per daily active user? The table shown is just a sample of what the extended table is. Imagine loads more rows.
I simply used this query to calculate the daily active users:
SELECT COUNT (DISTINCT user_id)
FROM table1
and welcome to StackOverflow!
now, your question:
How would I calculate average session length per daily active user?
you already have the session time, and using AVG function you will get a simple average for all
select AVG(session_length_seconds) avg from table_1
but you want per day... so you need to think as group by day, so how do you get the day? you have a activity_date as a Date entry, it's easy to extract day, month and year from it, for example
select
DAY(activity_date) day,
MONTH((activity_date) month,
YEAR(activity_date) year
from
table_1
will break down the date field in columns you can use...
now, back to your question, it states daily active user, but all you have is sessions, a user could have multiple sessions, so I have no idea, from the context you have shared, how you go about that, and make the avg for each session, makes no sense as data to retrieve, I'll just assume, and serves this answer just to get you started, that you want the avg per day only
knowing how to get the average, let's create a query that has it all together:
select
DAY(activity_date) day,
MONTH((activity_date) month,
YEAR(activity_date) year,
AVG(session_length_seconds) avg
from
table_1
group by
DAY(activity_date),
MONTH((activity_date),
YEAR(activity_date)
will output the average of session_length_seconds per day/month/year
the group by part, you need to have as many fields you have in the select but that do not do any calculation, like sum, count, etc... in our case avg does calculation, so we don't want to group by that value, but we do want to group by the other 3 values, so we have a 3 columns with day, month and year. You can also use concat to join day, month and year into just one string if you prefer...
data table looks like this
Use a query to calculate average income per hour by day of week.
SELECT WEEKDAY(date_start_time), SUM(total_income)/SUM(DATEDIFF((hour,
date_start_time, date_end_time) AS avg_income
FROM Deliveries
GROUP BY WEEKDAY(date_start_time)
Things to know:
Entry_id is a unique key for each time the employee comes into the office
There will be many records of the same user_id if an employee comes into the office repeatedly
Tasks completed will most likely stay unused in this question
Am I appropriately answering this question?
Things I am concerned about:
1) Does DATEDIFF only return an integer value? If thats the case, then to have a better estimation of the avg_income does this mean we should use DATEDIFF(minutes, ..., ...) and then calculate the hours with decimal places from that integer?
2) Are people working overnight shifts something that I need to worry about? How much more complicated would it make this query?
3) Moving onward if I was asked to "calculate the average earnings per hour during 9am to 5pm" does this mean I need to calculate this for each individual employee... or for each individual hour (ie. ultimately am I grouping by hour or by user_ID)?
1) Use timediff()
2) You will not only need to consider overnight shifts but you will need to consider overtime pay if they work > 40 hours in between the week start date and the week end date for a given week. This is only if employees are paid different hourly rates for these (ex.time and a half). If this is a factor then you will need to roll up your sleeves because it will be a full algorithm.
3) This depends on what you are trying to find the average by (user, day, etc.) but a simple way would be to just nest your select and grab an avg().
select avg(earnings) overall_average from
(select user, [calculated_earnings] as earnings from [table] where [conditions])
select avg(earnings) overall_average from
(select weekday, [calculated_earnings] as earnings from [table] where [conditions])
For a task in a managerial accounting context, I generated a relatively large SQL-Query with a MySQL-DB. This query has close to 600 lines and generates as a result a large table with the economic analysis for different products.
This works fine so far and the query just takes about 3 seconds.
But the outcome is only the analysis for one month. Now we would like to execute the query for a couple of month and aggregate the results.
I simply could change the query to include a larger time period as a condition (now just one month). But that would lead to an incorrect (averaged) distribution of overhead costs due to ignoring of larger monthly fluctuations in certain key figures.
Therefore, I think, I would have to generate one (sub-)table per month I would like to analyze. Finally, all these sub-tables would have to be aggregated with a superordinate main query. That should probably work, but this query would then be really large. E.g. for 12 months I would need about 12 x 600 lines for the sub-queries and about another 100 lines for the main query.
This leads to my question: Is this the way how one would do that? Without better knowing, it seems to me an unusually large query which also might be cumbersome to maintain. What would be the best practice way to accomplish the given task?
Thank you
If the data is static once the month is over you can launch your select at the beginning of each month (to calculate the previous month) and store the result in a table with an extra column "month".
insert into monthly_aggregation (month, ...)
select ... <600 lines of SQL for specific month>
This can be triggered at the beginning of every month.
If historical data can change, you have to rebuild the whole table by executing the INSERT-SElECT per month.
Let's say this is your query showing products for a particular month:
select product_id, sum(purchased), avg(price), ...
from <many tables>
where month = 6
group by product_id;
Then you can change it thus to have it show product data per month:
select month, product_id, sum(purchased), avg(price), ...
from <many tables>
group by month, product_id;
You can then work on this with an outer query:
select ...
from
(
select month, product_id, sum(purchased), avg(price), ...
from <many tables>
group by product_id, month
) product_and_month
group by ...;
What is the best way to think about the Group By function in MySQL?
I am writing a MySQL query to pull data through an ODBC connection in a pivot table in Excel so that users can easily access the data.
For example, I have:
Select
statistic_date,
week(statistic_date,4),
year(statistic_date),
Emp_ID,
count(distict Emp_ID),
Site
Cost_Center
I'm trying to count the number of unique employees we have by site by week. The problem I'm running into is around year end, the calendar years don't always match up so it is important to have them by date so that I can manually filter down to the correct dates using a pivot table (2013/2014 had a week were we had to add week 53 + week 1).
I'm experimenting by using different group by statements but I'm not sure how the order matters and what changes when I switch them around.
i.e.
Group by week(statistic_date,4), Site, Cost_Center, Emp_ID
vs
Group by Site, Cost_Center, week(statistic_date,4), Emp_ID
Other things to note:
-Employees can work any number of days. Some are working 4 x 10's, others 5 x 8's with possibly a 6th day if they sign up for OT. If I sum the counts by week, I get anywhere between 3-7 per Emp_ID. I'm hoping to get 1 for the week.
-There are different pay code per employee so the distinct count helps when we are looking by day (VTO = Voluntary Time Off, OT = Over Time, LOA = Leave of Absence, etc). The distinct count will show me 1, where often times I will have 2-3 for the same emp in the same day (hits 40 hours and starts accruing OT then takes VTO or uses personal time in the same day).
I'm starting with a query I wrote to understand our paid hours by week. I'm trying to adapt it for this application. Actual code is below:
SELECT
dkh.STATISTIC_DATE AS 'Date'
,week(dkh.STATISTIC_DATE,4) as 'Week'
,month(dkh.STATISTIC_DATE) as 'Month'
,year(dkh.STATISTIC_DATE) as 'Year'
,dkh.SITE AS 'Site ID Short'
,aep.LOC_DESCR as 'Site Name'
,dkh.EMPLOYEE_ID AS 'Employee ID'
,count(distinct dkh.EMPLOYEE_ID) AS 'Distinct Employee ID'
,aep.NAME AS 'Employee Name'
,aep.BUSINESS_TITLE AS 'Business_Ttile'
,aep.SPRVSR_NAME AS 'Manager'
,SUBSTR(aep.DEPTID,1,4) AS 'Cost_Center'
,dkh.PAY_CODE
,dkh.PAY_CODE_SHORT
,dkh.HOURS
FROM metrics.DAT_KRONOS_HOURS dkh
JOIN metrics.EMPLOYEES_PUBLIC aep
ON aep.SNAPSHOT_DATE = SUBDATE(dkh.STATISTIC_DATE, DAYOFWEEK(dkh.STATISTIC_DATE) + 1)
AND aep.EMPLID = dkh.EMPLOYEE_ID
WHERE dkh.STATISTIC_DATE BETWEEN adddate(now(), interval -1 year) AND DATE(now())
group by dkh.SITE, SUBSTR(aep.DEPTID,1,4), week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE, dkh.EMPLOYEE_ID
The order you use in group by doesn't matter. Each unique combination of the values gets a group of its own. Selecting columns you don't group by gives you somewhat arbitrary results; you'd probably want to use some aggregation function on them, such as SUM to get the group total.
Grouping by values you derive from other values that you already use in group by, like below, isn't very useful.
week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE
If two rows have different weeks, they'll also have different dates, right?
I have a SQL question. First of all I'd like to know is it even possible with just SQL, and if not does anyone know a good workaround.
We are building a site, where users can vote for videos.
The users can vote by SMS or directly on site after Facebook authentication.
We have to make a top list of all videos, and calculate the "position" on the list for each video.
So far, we have done that with a simple subquery, something like this:
SELECT v.video_id AS id,
(SELECT (COUNT(*)+1) FROM videos AS v2
WHERE (v2.SMS_votes + v2.facebook_votes) > (v.SMS_votes + v.facebook_votes)) AS total_position
FROM videos AS v
SMS_votes and facebook_votes are aggregated fields. There are separate tables for each kind of votes, with records for each vote, including the time the vote has been set.
This works fine, the positions are calculated... if 2 or more videos have the same number of votes, they "share" the position.
Unfortunately there can be no position sharing, and we have to resolve it by the following rules:
if 2 videos have the same number of votes, the one with more SMS votes has the advantage
if they also have the same number of SMS votes, the one which has more SMS votes in the last hour has the advantage
if they also have the same number of SMS votes in the last hour, they are compared by the hour before, and recursively like that, until there is a difference between the two
Is it possible to do this kind of recursive ordering only in SQL, or do we have to resolve this manually in code? All ideas are welcomed. Just to note, performance is important here, because the top list is used all over the site.
I don't think it's feasible to perform this kind of ordering with a recusive calculation (which is potentially unbounded), but if you're willing to limit the amount of time you look back, there are ways it could be done.
Here's one possibility.
SELECT video_id,
SMS_votes + facebook_votes AS total_votes,
SMS_votes,
COUNT(CASE WHEN time > NOW() - INTERVAL 1 HOUR THEN 1 END) AS h1,
COUNT(CASE WHEN time > NOW() - INTERVAL 2 HOUR THEN 1 END) AS h2,
COUNT(CASE WHEN time > NOW() - INTERVAL 3 HOUR THEN 1 END) AS h3
FROM videos
JOIN SMS_votes USING(video_id)
GROUP BY video_id
ORDER BY total_votes DESC, SMS_votes DESC, h1 DESC, h2 DESC, h3 DESC;
This assumes you have a table called SMS_votes tracking each vote, with a video_id field and a time field.
For each video, it calculates the total votes, the SMS votes, the SMS votes in the past hour, the past two hours, and the past three hours. It then does an ORDER BY on all those values to get the correct position.
It's fairly easy to extend this to include a wider range of hours, but you might also want to consider using an increasing time range as you go back in time. For example, you first look at votes in the past hour, then the past day, then the past week, etc. I suspect that would lower your chance of videos having the same votes without having to add as many extra calculations.
SQL Fiddle example