Aggregating statistics into JSON in Postgresql - json

So I am trying to calculate overview statistics into JSON, but am having trouble wrangling them into a query.
There are 2 tables:
appointments
- time timestamp
- patients int
assignments
- user_id int
- appointment_id int
I want to calculate the number of patients by user, by hour for the day. Ideally, it would look like this:
[
{hour: "2015-07-01T08:00:00.000Z", assignments: [
{user_id: 123, patients: 3},
{user_id: 456, patients: 10},
{user_id: 789, patients: 4},
]},
{hour: "2015-07-01T09:00:00.000Z", assignments: [
{user_id: 456, patients: 1},
{user_id: 789, patients: 6}
]},
{hour: "2015-07-01T10:00:00.000Z", assignments: []}
...
]
I got kind of close:
with assignments_totals as (
select user_id,sum(patients),date_trunc('hour',appointments.time) as hour
from assignments
inner join appointments on appointments.id = assignments.appointment_id
group by date_trunc('hour',sales.time),user_id
), hours as (
select to_char(date_trunc('hour',time),'YYYY-MM-DD"T"HH24:00:00.000Z') as hour, array_to_json(array_agg(DISTINCT assignment_totals)) as patients
from appointments
left join assignment_totals on date_trunc('hour',sales.time) = assignment_totals.hour
where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z'
group by date_trunc('hour',time)
order by date_trunc('hour',time)
)
select array_to_json(array_agg(hours)) as hours from hours;
Which outputs:
[
{hour: "2015-07-01T08:00:00.000Z", assignments: [
{user_id: 123, patients: 3, hour: "2015-07-01T08:00:00.000Z" },
{user_id: 456, patients: 10, hour: "2015-07-01T08:00:00.000Z"},
{user_id: 789, patients: 4, hour: "2015-07-01T08:00:00.000Z"},
]},
{hour: "2015-07-01T09:00:00.000Z", assignments: [
{user_id: 456, patients: 1, hour: "2015-07-01T09:00:00.000Z"},
{user_id: 789, patients: 6, hour: "2015-07-01T09:00:00.000Z"}
]},
{hour: "2015-07-01T10:00:00.000Z", assignments: [null]}
...
]
While this works, there are 2 issues, which may or may not be independent of each other:
If there are no appointments that hour, I still want the hour to be included in the array (like 10AM in the example), but to have an empty "assignments" array. Right now it puts a null in there, and I can't figure out how to get rid of it while still keeping the hours in there.
I have to have the hour included in the assignments entries along with user_id and appointments because I need it to join the assignments_totals query to the hours query. But it's unnecessary because it's already in the parent.
I feel like it should be able to be done in 1 cte and 1 query and now I'm using 2 cte's... but can't figure out how to condense it and make it work.
I wanted to do something like
hours as (
select to_char(date_trunc('hour',time),'YYYY-MM-DD"T"HH24:00:00.000Z') as hour, sum(appointments.patients) OVER(partition by assignments.user_id) as appointments
from appointments
left join assignments on appointments.id = assignments.appointment_id
where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z'
group by date_trunc('hour',time)
order by date_trunc('hour',time)
)
select array_to_json(array_agg(hours)) as hours from hours
but i can't get it to work without giving me a "attribute must be in the group by or aggregate function error.
Anyone know how to fix any of these issues? Thanks in advance!

The main issue with your last query seems to be in conflating window functions with aggregate functions. Window functions use the OVER syntax, and they do not in themselves require GROUP BY when there are other fields in the SELECT clause. Aggregate functions, on the other hand, use GROUP BY when there are other (non-aggregate-function) fields in the SELECT clause. One practical consequence of this difference is that window functions are not automatically DISTINCT.
The issue with NULL values resulting from the window function can be resolved with a simple COALESCE such that zero is used instead of null.
So, to write your query using a window function, use something like:
WITH hours AS
(
SELECT DISTINCT to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z') AS hour,
COALESCE(SUM(ap.patients) OVER (PARTITION BY asgn.user_id), 0) AS appointment_count
FROM appointments ap
LEFT JOIN assignments asgn ON ap.id = asgn.appointment_id
WHERE ap.time >= '2015-07-01T07:00:00.000Z'
AND ap.time < '2015-07-02T07:00:00.000Z'
)
SELECT array_to_json(array_agg(hours)) AS hours
FROM hours
ORDER BY hour
With an aggregate function:
WITH hours AS
(
SELECT to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z') AS hour,
SUM(COALESCE(ap.patients, 0)) AS appointment_count,
asgn.user_id
FROM appointments ap
LEFT JOIN assignments asgn ON ap.id = asgn.appointment_id
WHERE ap.time >= '2015-07-01T07:00:00.000Z'
AND ap.time < '2015-07-02T07:00:00.000Z'
GROUP BY asgn.user_id, to_char(date_trunc('hour', ap.time), 'YYYY-MM-DD"T"HH:00:00.000Z')
)
SELECT array_to_json(array_agg(hours)) AS hours
FROM hours
ORDER BY hour
My syntax may not be quite correct, so double-check before using this solution or one like it (and feel free to edit to correct any errors).

Most of my frustration with this came because I was not looking at the Postgres 9.4 documentation, which has new functions for dealing with json.
The solution I found builds upon the original query, but then breaks the assignments array down using json_array_elements, filters using where, then builds it back up again. It seems pointless to have essentially:
json_agg(json_array_elements(json_agg(*)))
But it makes very little performance difference and gets me where I need to go. Feel free to comment if you find a better solution! It should also be possible in <9.4 using array_agg and unnest but I was having trouble because I was trying to unnest a record type returned from my CTE, instead of an actual row type with column definitions.
with assignment_totals as (
select
date_trunc('hour',appointments.time) as hour,
user_id,
coalesce(sum(patients),0) as patients
from appointments
left outer join assignments on appointment.id = assignments.appointment_id
where time >= '2015-07-01T07:00:00.000Z' and time < '2015-07-02T07:00:00.000Z'
group by date_trunc('hour',appointments.time),user_id
), hours as (
select
to_char(assignment_totals.hour,'YYYY-MM-DD"T"HH24:00:00.000Z') as hour,
(
select coalesce(json_agg(json_build_object('user_id',(t->'user_id'),'patients',(t->'patients')) order by (t->>'user_id')),'[]'::json)
from json_array_elements(json_agg(assignment_totals)) t
where (t->>'patients') != '0'
) as patients
from assignment_totals
group by assignment_totals.hour
order by assignment_totals.hour
)
select array_to_json(array_agg(hours)) as hours from hours
Thanks to Andrew for pointing out that I can coalesce nulls to 0. But I still want to filter out entries where patients = 0. This solves all my problems by giving me the ability to filter them out with a where, and then gives me the ability to take out the time by building a new json object with json_build_object.

Related

Select with Inner Join operator and IN (long results) - Optimize Query

I do a MySql query that looks for results in two different tables.
Tables
Contract
id, contract, creditor_id, client_id, event_id
Invoice
id, contract_id, invoice, due, value
The idea is to select the contracts using some parameters in the query, such as:
initial delay and final, initial value and final, events, creditor.
For this, I use the INNER JOIN, HAVING and IN.
Details:
After receiving the result, I take the values ​​and loop to make an update on each query result, using the result ID.
I built an example in SQL Fiddle for better visualization.
The problem is, when I do this query with very long results or thousands of lines, the query is really slow.
So, I wanted to know if there is a better way to do the same query in an optimal way.
Query:
SELECT `c`.`id`,
`c`.`contract`,
`c`.`creditor_id`,
`c`.`client_id`,
`c`.`event_id`,
`t`.`total_value`,
`delay`
FROM `contract` `c`
INNER JOIN
(SELECT contract_id,
Sum(value) total_value,
Datediff(Curdate(), due) AS delay
FROM invoice t GROUP BY contract_id
HAVING delay <= 99999
AND delay >= 1
AND total_value >= 1
AND total_value < 99999) t ON `t`.`contract_id` = `c`.`id`
WHERE `c`.`creditor_id` = 1
AND `c`.`event_id` IN(4, 7, 5, 8, 13, 3, 6, 15, 2, 24, 1, 21, 20, 14, 17, 18, 16, 23, 25, 22, 9, 10, 26, 12, 19, 11)
If "1..99999" means "any value", then remove the test from the query. That is construct a different query when the user wants an open-ended test.
Deal with the lack of due in the GROUP BY.
Change Datediff(Curdate(), due) > 123 to due < CURDATE() - INTERVAL 123 DAY. That will give us a chance to use due in an INDEX.
Qualify due and value; we can't tell which table they are in.
Please provide SHOW CREATE TABLE.
c could use INDEX(creditor_id, event_id), but after the above issues are addressed, there may be an even better index.

Generating complex sql tables

I currently have an employee logging sql table that has 3 columns
fromState: String,
toState: String,
timestamp: DateTime
fromState is either In or Out. In means employee came in and Out means employee went out. Each row can only transition from In to Out or Out to In.
I'd like to generate a temporary table in sql to keep track during a given hour (hour by hour), how many employees are there in the company. Aka, resulting table has columns HourBucket, NumEmployees.
In non-SQL code I can do this by initializing the numEmployees as 0 and go through the table row by row (sorted by timestamp) and add (employee came in) or subtract (went out) to numEmployees (bucketed by timestamp hour).
I'm clueless as how to do this in SQL. Any clues?
Use a COUNT ... GROUP BY query. Can't see what you're using toState from your description though! Also, assuming you have an employeeID field.
E.g.
SELECT fromState AS 'Status', COUNT(*) AS 'Number'
FROM StaffinBuildingTable
INNER JOIN (SELECT employeeID AS 'empID', MAX(timestamp) AS 'latest' FROM StaffinBuildingTable GROUP BY employeeID) AS LastEntry ON StaffinBuildingTable.employeeID = LastEntry.empID
GROUP BY fromState
The LastEntry subquery will produce a list of employeeIDs limited to the last timestamp for each employee.
The INNER JOIN will limit the main table to just the employeeIDs that match both sides.
The outer GROUP BY produces the count.
SELECT HOUR(SBT.timestamp) AS 'Hour', SBT.fromState AS 'Status', COUNT(*) AS 'Number'
FROM StaffinBuildingTable AS SBT
INNER JOIN (
SELECT SBIJ.employeeID AS 'empID', MAX(timestamp) AS 'latest'
FROM StaffinBuildingTable AS SBIJ
WHERE DATE(SBIJ.timestamp) = CURDATE()
GROUP BY SBIJ.employeeID) AS LastEntry ON SBT.employeeID = LastEntry.empID
GROUP BY SBT.fromState, HOUR(SBT.timestamp)
Replace CURDATE() with whatever date you are interested in.
Note this is non-optimal as it calculates the HOUR twice - once for the data and once for the group.
Again you are using the INNER JOIN to limit the number of returned row, this time to the last timestamp on a given day.
To me your description of the FromState and ToState seem the wrong way round, I'd expect to doing this based on the ToState. But assuming I'm wrong on that the following should point you in the right direction:
First, I create a "Numbers" table containing 24 rows one for each hour of the day:
create table tblHours
(Number int);
insert into tblHours values
(0),(1),(2),(3),(4),(5),(6),(7),
(8),(9),(10),(11),(12),(13),(14),(15),
(16),(17),(18),(19),(20),(21),(22),(23);
Then for each date in your employee logging table, I create a row in another new table to contain your counts:
create table tblDailyHours
(
HourBucket datetime,
NumEmployees int
);
insert into tblDailyHours (HourBucket, NumEmployees)
select distinct
date_add(date(t.timeStamp), interval h.Number HOUR) as HourBucket,
0 as NumEmployees
from
tblEmployeeLogging t
CROSS JOIN tblHours h;
Then I update this table to contain all the relevant counts:
update tblDailyHours h
join
(select
h2.HourBucket,
sum(case when el.fromState = 'In' then 1 else -1 end) as cnt
from
tblDailyHours h2
join tblEmployeeLogging el on
h2.HourBucket >= el.timeStamp
group by h2.HourBucket
) cnt ON
h.HourBucket = cnt.HourBucket
set NumEmployees = cnt.cnt;
You can now retrieve the counts with
select *
from tblDailyHours
order by HourBucket;
The counts give the number on site at each of the times displayed, if you want during the hour in question, we'd need to tweak this a little.
There is a working version of this code (using not very realistic data in the logging table) here: rextester.com/DYOR23344
Original Answer (Based on a single over all count)
If you're happy to search over all rows, and want the current "head count" you can use this:
select
sum(case when t.FromState = 'In' then 1 else -1) as Heads
from
MyTable t
But if you know that there will always be no-one there at midnight, you can add a where clause to prevent it looking at more rows than it needs to:
where
date(t.timestamp) = curdate()
Again, on the assumption that the head count reaches zero at midnight, you can generalise that method to get a headcount at any time as follows:
where
date(t.timestamp) = "CENSUS DATE" AND
t.timestamp <= "CENSUS DATETIME"
Obviously you'd need to replace my quoted strings with code which returned the date and datetime of interest. If the headcount doesn't return to zero at midnight, you can achieve the same by removing the first line of the where clause.

Multi-Series (Column) MySQL Query Won't Summarize Properly

I have several years worth of data in a table (inquiries). Every entry has a contact_time field that is the timestamp of their email contact. I'm trying to build monthly or weekly summary data for plotting on a multi-series graph. To that end, I need to see the month or week number in the first column with the respective data from 2014 in the second column, and from 2015 in the third column, etc.
SELECT MONTH(inquiries.contact_time) AS "Date",
(SELECT
COUNT(inquiries.id) AS "Inquiries"
FROM inquiries
WHERE YEAR(inquiries.contact_time) = "2014"
) AS "2014",
(SELECT COUNT(inquiries.id) AS "Inquiries"
FROM inquiries
WHERE YEAR(inquiries.contact_time) = "2015"
) AS "2015"
FROM inquiries
GROUP BY MONTH(inquiries.contact_time)
All I'm seeing is the total count for each year in all of the rows. Any help is appreciated.
Use conditional aggregation:
SELECT MONTH(i.contact_time),
SUM(YEAR(i.contact_time) = 2014) as cnt_2014,
SUM(YEAR(i.contact_time) = 2015) as cnt_2015
FROM inquiries i
WHERE YEAR(i.contact_time) >= 2014
GROUP BY MONTH(i.contact_time);
If you have an index on contact_time, then use the condition where i.contact_time >= '2014-01-01', so it can take advantage of the index.
You're seeing the total count for the year because your subqueries are not related to the month for the outer query's grouping.
I would write the query this way:
SELECT MONTH(contact_time) AS `Date`,
SUM(YEAR(contact_time)=2014) AS `2014`,
SUM(YEAR(contact_time)=2015) AS `2015`
FROM inquiries
GROUP BY MONTH(contact_time)
Explanation: the COUNT() of a specific set of rows is the same as the SUM() of 1's for those rows. And MySQL boolean expressions return the integer 1 for true.

MySQL Week Function Unexpected Results

I am querying a database of hour entries and summing up by company and by week. I understand that MySQL's week function is based on a calendar week. That being said, I'm getting some unexpected grouping results. Perhaps you sharp-eyed folks can lend a hand:
SELECT * FROM (
SELECT
tms.date,
SUM( IF( tms.skf_group = "HP Group", tms.hours, 0000.00 )) as HPHours,
SUM( IF( tms.skf_group = "SKF Canada", tms.hours, 000.00 )) as SKFHours
FROM time_management_system tms
WHERE date >= "2012-01-01"
AND date <= "2012-05-11"
AND tms.skf_group IN ( "HP Group", "SKF Canada" )
GROUP BY WEEK( tms.date, 7 )
# ORDER BY tms.date DESC
# LIMIT 7
) AS T1
ORDER BY date ASC
My results are as follows: (Occasionally we don't have entries on a Sunday for example. Do null values matter?)
('date'=>'2012-01-01','HPHours'=>'0.00','SKFHours'=>'2.50'),
('date'=>'2012-01-02','HPHours'=>'97.00','SKFHours'=>'78.75'),
('date'=>'2012-01-09','HPHours'=>'86.50','SKFHours'=>'100.00'),
('date'=>'2012-01-16','HPHours'=>'68.00','SKFHours'=>'96.25'),
('date'=>'2012-01-24','HPHours'=>'39.00','SKFHours'=>'99.50'),
('date'=>'2012-02-05','HPHours'=>'3.00','SKFHours'=>'93.00'),
('date'=>'2012-02-06','HPHours'=>'12.00','SKFHours'=>'122.50'),
('date'=>'2012-02-13','HPHours'=>'64.75','SKFHours'=>'117.50'),
('date'=>'2012-02-21','HPHours'=>'64.50','SKFHours'=>'93.00'),
('date'=>'2012-03-02','HPHours'=>'45.50','SKFHours'=>'143.25'),
('date'=>'2012-03-05','HPHours'=>'62.00','SKFHours'=>'136.75'),
('date'=>'2012-03-12','HPHours'=>'54.25','SKFHours'=>'133.00'),
('date'=>'2012-03-19','HPHours'=>'77.75','SKFHours'=>'130.75'),
('date'=>'2012-03-26','HPHours'=>'61.00','SKFHours'=>'147.00'),
('date'=>'2012-04-02','HPHours'=>'86.75','SKFHours'=>'96.75'),
('date'=>'2012-04-09','HPHours'=>'84.25','SKFHours'=>'120.50'),
('date'=>'2012-04-16','HPHours'=>'90.00','SKFHours'=>'127.25'),
('date'=>'2012-04-23','HPHours'=>'103.25','SKFHours'=>'89.50'),
('date'=>'2012-05-02','HPHours'=>'72.50','SKFHours'=>'143.75'),
('date'=>'2012-05-07','HPHours'=>'68.25','SKFHours'=>'119.00')
January 2nd is the first Monday, hence Jan 1st is only one day. I would expect the output to be consecutive Mondays (Monday Jan 2, 9, 16, 23, 30, etc)? The unexpected week groupings below continue throughout the results. Any ideas?
Thanks very much!
It's not clear what selecting tms.date even means when you're grouping by some function on tms.date. My guess is that it means "the date value from any source row corresponding to this group". At that point, the output is entirely reasonable.
Given that any given group can have seven dates within it, what date do you want to get in the results?
EDIT: This behaviour is actually documented in "GROUP BY and HAVING with Hidden Columns":
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause.
...
The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values the server chooses.
The tms.date column isn't part of the GROUP BY clause - only a function operating on tms.date is part of the GROUP BY clause, so I believe the text above applies to the way that you're selecting tms.date: you're getting any date within that week.
If you want the earliest date, you might try
SELECT MIN(tms.date), ...
That's assuming that MIN works with date/time fields, of course. I can't easily tell from the documentation.
Question is not clear for me but I guess you don't want to group by week. Because week gives week of year. which is 19th week today.
I think you want to group by Weekday like GROUP BY WEEKday(tms.date)

Aggregating data by timespan in MySQL

Basically I want is to aggregate some values in a table according to a timespan.
What I do is, I take snapshots of a system every 15 minutes and I want to be able to draw some graph over a long period. Since the graphs get really confusing if too many points are shown (besides getting really slow to render) I want to reduce the number of points by aggregating multiple points into a single point by averaging over them.
For this I'd have to be able to group by buckets that can be defined by me (daily, weekly, monthly, yearly, ...) but so far all my experiments had no luck at all.
Is there some trick I can apply to do so?
I had a similar question: collating-stats-into-time-chunks and had it answered very well. In essence, the answer was:
Perhaps you can use the DATE_FORMAT() function, and grouping. Here's an example, hopefully you can adapt to your precise needs.
SELECT
DATE_FORMAT( time, "%H:%i" ),
SUM( bytesIn ),
SUM( bytesOut )
FROM
stats
WHERE
time BETWEEN <start> AND <end>
GROUP BY
DATE_FORMAT( time, "%H:%i" )
If your time window covers more than one day and you use the example format, data from different days will be aggregated into 'hour-of-day' buckets. If the raw data doesn't fall exactly on the hour, you can smooth it out by using "%H:00."
Thanks be to martin clayton for the answer he provided me.
It's easy to truncate times to the last 15 minutes (for example), by doing something like:
SELECT dateadd(minute, datediff(minute, '20000101', yourDateTimeField) / 15 * 15, '20000101') AS the15minuteBlock, COUNT(*) as Cnt
FROM yourTable
GROUP BY dateadd(minute, datediff(minute, '20000101', yourDateTimeField) / 15 * 15, '20000101');
Use similar truncation methods to group by hour, week, whatever.
You could always wrap it up in a CASE statement to handle multiple methods, using:
GROUP BY CASE #option WHEN 'week' THEN dateadd(week, .....
As an addition to #cmroanirgo, I didn't need "sums" of data, but avarages (to see the avarage FPS / player count of my game servers). And, I need to view in detail per 5 minutes - or view an entire week of data (data gets stored every minute).
As an example, you can use the SQL command AVG instead of SUM to get an avarage. Also, you'd have to name your selected values to something, and it shouldn't be the actual field name (that will conflict lateron in your query). Here's the query I'm using to aggregate avarages, of 1 week, by the hour:
SELECT
DATE_FORMAT( moment, "%Y-%m-%d %H:00" ) as _moment,
AVG( maxplayers ) as _maxplayers,
AVG( players ) as _players,
AVG( servers ) as _servers,
AVG( avarage_fps ) as _avarage_fps,
AVG( avarage_realfps ) as _avarage_realfps,
AVG( avarage_maxfps ) as _avarage_maxfps
FROM
playercount
WHERE
moment BETWEEN "<date minus 1 week>" AND "<now>"
GROUP BY
_moment
ORDER BY moment ASC
This is then used (together with PHP) to use in a Bootstrap graph;
<?php
//Do the query here
foreach ($result->fetch_all(MYSQLI_ASSOC) as $item) {
$labels[] = $item['_moment'];
$maxplayers[] = $item['_maxplayers'];
$players[] = $item['_players'];
$servers[] = $item['_servers'];
$fps[] = $item['_avarage_fps'];
$fpsreal[] = $item['_avarage_realfps']/10;
$fpsmax[] = $item['_avarage_maxfps'];
}
?>
var playerChartId = document.getElementById("playerChartId");
var playerChart = new Chart(playerChartId, {
type: 'line',
data: {
labels: ["<?= implode('","', $labels); ?>"],
datasets: [
{
data: [<?= implode(',', $servers); ?>],
borderColor: '#007bff',
pointRadius: 0
},
//etc...