I have a number of stores where I would like to sum the energy consumption so far this year compared with the same period last year. My challenge is that in the current year the stores have different date intervals in terms of delivered data. That means that store A may have data between 01.01.2018 and 20.01.2018, and store B may have data between 01.01.2018 and 28.01.2018. I would like to sum the same date intervals current year versus previous year.
Data looks like this
Store Date Sum
A 01.01.2018 12
A 20.01.2018 11
B 01.01.2018 33
B 28.01.2018 32
But millions of rows and would use these dates as references to get the same sums previous year.
This is my (erroneous) try:
SET #curryear = (SELECT YEAR(MAX(start_date)) FROM energy_data);
SET #maxdate_curryear = (SELECT MAX(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
SET #mindate_curryear = (SELECT MIN(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
-- the same date intervals last year
SET #maxdate_prevyear = (#maxdate_curryear - INTERVAL 1 YEAR);
SET #mindate_prevyear = (#mindate_curryear - INTERVAL 1 YEAR);
-- sums current year
CREATE TABLE t_sum_curr AS
SELECT name as name_curr, sum(kwh) as sum_curr, min(start_date) AS
min_date_curr, max(start_date) AS max_date_curr, count(distinct
start_date) AS ant_timer FROM energy_data WHERE agg_type = 'timesnivå'
AND start_date >= #mindate_curryear and start_date <= #maxdate_curryear GROUP BY NAME;
-- also seems fair, the same dates one year ago, figured I should find those first and in the next query use that to sum each stores between those date intervals
CREATE TABLE t_sum_prev AS
SELECT name_curr as name_curr2, (min_date_curr - INTERVAL 1 YEAR) AS
min_date_prev, (max_date_curr - INTERVAL 1 YEAR) as max_date_prev FROM
t_sum_curr;
-- getting into trouble!
CREATE TABLE the_results AS
SELECT name, start_date, sum(kwh) as sum_prev from energy_data where
agg_type = 'timesnivå' and
start_date >= #mindate_prevyear and start_date <=
#maxdate_prevyear group by name having start_date BETWEEN (SELECT
min_date_prev from t_sum_prev) AND
(SELECT max_date_prev from t_sum_prev);
`
This last query just tells me that my sub query returns more than 1 row and throws an error message.
I assume what you have is a list of energy consumption figures, where bills or readings have been taken at irregular times, so the consumption covers irregular periods.
The basic approach you need to take is to regularise the consumption periods - by establishing which days each periods covers, and then breaking each reading down into as many days as it covers, and the consumption for each day being a daily average of the period.
I'm assuming the consumption periods are entirely sequential (as a bill or reading normally would be), and not overlapping.
Because of the volume of rows involved (you say millions even in its current form), you might not want to leave the data in daily form - it might suffice to regroup them into regular weekly, monthly, or quarterly periods, depending on what level of granularity you require for comparison.
Once you have your regular periods, comparison will be as easy as cake.
If this is part of a report that will be run on an ongoing basis, you'd probably want to implement some logic that calculates a "regularised consumption" incrementally and on a scheduled basis and stores it in a summary table, with appropriate columns and indexes, so that you aren't having to process many millions of historical rows each time the report is run.
Trying to work around the irregular periods (if indeed it can be done) with fancy joins and on-the-fly averages, rather than tackling them head on, will likely lead to very difficult logic, and particularly on a data set of this size, dire performance.
EDIT: from the comments below.
#Alexander, I've knocked together an example of a query. I haven't tested it and I've written it all in a text editor, so excuse any small syntax errors. What I've come up with seems a bit complex (more complex than I imagined when I began), but I'm also a little bit tired, so I'm not sure whether it could be simplified further.
The only point I would make is that the performance of this query (or any such query), because of the nature of what it has to do in traversing date ranges, is likely to be appalling on a table with millions of rows. I stand by my earlier remarks, that proper indexing of the source data will be crucial, and summarising the source data into a larger granularity will massively aid performance (at the expense of a one-off hit to summarise it). Even daily granularity, will reduce the number of rows by a factor of 24!
WITH energy_data_ext AS
(
SELECT
ed.name AS store_name
,YEAR(ed.start_date) AS reading_year
,ed.start_date AS reading_date
,ed.kwh AS reading_kwh
FROM
energy_data AS ed
)
,available_stores AS
(
SELECT ede.store_name
FROM energy_data_ext AS ede
GROUP BY ede.store_name
)
,current_reading_yr_per_store AS
(
SELECT
ede.store_name
,MAX(ede.reading_year) AS current_reading_year
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
)
,latest_reading_ranges_per_year AS
(
SELECT
ede.store_name
,ede.reading_year
,MAX(ede.start_date) AS latest_reading_date_of_yr
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
,ede.reading_year
)
,store_reading_ranges AS
(
SELECT
avs.store_name
,lryps.current_reading_year
,lyrr.latest_reading_date_of_yr AS current_year_latest_reading_date
,(lryps.current_reading_year - 1) AS prev_reading_year
,(lyrr.latest_reading_date_of_yr - INTERVAL 1 YEAR) AS prev_year_latest_reading_date
FROM
available_stores AS avs
LEFT JOIN
current_reading_yr_per_store AS lryps
ON (lryps.store_name = avs.store_name)
LEFT JOIN
latest_reading_ranges_per_year AS lyrr
ON (lyrr.store_name = avs.store_name)
AND (lyrr.reading_year = lryps.current_reading_year)
)
--at this stage, we should have all the calculations we need to
--establish the range for the latest year, and the range for the year prior to that
,current_year_consumption AS
(
SELECT
avs.store_name
SUM(cyed.reading_kwh) AS latest_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS cyed
ON (cyed.reading_year = srs.current_reading_year)
AND (cyed.reading_date <= srs.current_year_latest_reading_date)
GROUP BY
avs.store_name
)
,prev_year_consumption AS
(
SELECT
avs.store_name
SUM(pyed.reading_kwh) AS prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS pyed
ON (pyed.reading_year = srs.prev_reading_year)
AND (pyed.reading_date <= srs.prev_year_latest_reading_date)
GROUP BY
avs.store_name
)
SELECT
avs.store_name
,srs.current_reading_year
,srs.current_year_latest_reading_date
,lyc.latest_year_kwh
,srs.prev_reading_year
,srs.prev_year_latest_reading_date
,pyc.prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
current_year_consumption AS lyc
ON (lyc.store_name = avs.store_name)
LEFT JOIN
prev_year_consumption AS pyc
ON (pyc.store_name = avs.store_name)
Related
I have the following two tables:
movie_sales (provided daily)
movie_id
date
revenue
movie_rank (provided every few days or weeks)
movie_id
date
rank
The tricky thing is that every day I have data for sales, but only data for ranks once every few days. Here is an example of sample data:
`movie_sales`
- titanic (ID), 2014-06-01 (date), 4.99 (revenue)
- titanic (ID), 2014-06-02 (date), 5.99 (revenue)
`movie_rank`
- titanic (ID), 2014-05-14 (date), 905 (rank)
- titanic (ID), 2014-07-01 (date), 927 (rank)
And, because the movie_rate.date of 2014-05-14 is closer to the two sales dates, the output should be:
id date revenue closest_rank
titanic 2014-06-01 4.99 905
titanic 2014-06-02 5.99 905
The following query works to get the results by getting the min date difference in the sub-select:
SELECT
id,
date,
revenue,
(SELECT rank from movie_rank where id=s.id ORDER BY ABS(DATEDIFF(date, s.date)) ASC LIMIT 1)
FROM
movie_sales s
But I'm afraid that this would have terrible performance as it will literally be doing millions of subselects...on millions of rows. What would be a better way to do this, or is there really no proper way to do this since an index can not be properly done with a DATEDIFF ?
Unfortunately, you are right. The movie rank table must be searched for each movie sale and of all matching movie rows the closest be picked.
With an index on movie_rank(id) the DBMS finds the movie rows quickly, but an index on movie_rank(id, date) would be better, because the date could be read from the index and only the one best match would be read from the table.
But you also say that there are new ranks every few dates. If it is guaranteed to find a rank in a certain range, e.g. for each date there will be at least one rank in the twenty days before and at least one rank in the twenty days after, you can limit the search accordingly. (The index on movie_rank(id, date) would be essential for this, though.)
SELECT
id,
date,
revenue,
(
select r.rank
from movie_rank r
where r.id = s.id
and r.date between s.date - interval 20 days
and s.date + interval 20 days
order by abs(datediff(date, s.date)) asc
limit 1
)
FROM movie_sales s;
This is difficult to get quick with SQL. In a programming language I would choose this algorithm:
Sort the two tables by date and point to the first rows.
Move the rank pointer forward until we match the sales date or are beyond it. (If we aren't there already.)
Compare the sales date with the rank date we are pointing at and with the rank date of the previous row. Take the closer one.
Move the sales pointer one row forward.
Go to 2.
With this algorithm we would already be in about the position we want to be. Let's see, if we can do the same with SQL. Iterations are done with recursive queries in SQL. These are available in MySQL as of version 8.0.
We start with sorting the rows, i.e. giving them numbers. Then we iterate through both data sets.
with recursive
sales as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_sales
),
ranks as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_rank
),
cte (movie_id, revenue, srn, rrn, sdate, rdate, rrank, closest_rank) as
(
select
movie_id, s.revenue, s.rn, r.rn, s.date, r.date, r.ranking,
case when s.date <= r.date then r.ranking end
from (select * from sales where rn = 1) s
join (select * from ranks where rn = 1) r using (movie_id)
union all
select
cte.movie_id,
cte.revenue,
coalesce(s.rn, cte.srn),
coalesce(r.rn, cte.rrn),
coalesce(s.date, cte.sdate),
coalesce(r.date, cte.rdate),
coalesce(r.ranking, cte.rrank),
case when coalesce(r.date, cte.rdate) >= coalesce(s.date, cte.sdate) then
case when abs(datediff(coalesce(r.date, cte.rdate), coalesce(s.date, cte.sdate))) <
abs(datediff(cte.rdate, coalesce(s.date, cte.sdate)))
then coalesce(r.ranking, cte.rrank)
else cte.rrank
end
end
from cte
left join sales s on s.movie_id = cte.movie_id and s.rn = cte.srn + 1 and cte.closest_rank is not null
left join ranks r on r.movie_id = cte.movie_id and r.rn = cte.rrn + 1 and cte.rdate < cte.sdate
where s.movie_id is not null or r.movie_id is not null
-- where cte.closest_rank is null
)
select
movie_id,
sdate,
revenue,
closest_rank
from cte
where closest_rank is not null;
(BTW: I named the column ranking, because rank is a reserved word in SQL.)
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=e994cb56798efabc8f7249fd8320e1cf
This is probably still slow. The reason for this is: there are no pointers to a row in SQL. If we want to go from row #1 to row #2, we must search that row, while in a programming language we would really just move the pointer one step forward. If the tables had an ID, we could build a chain (next_row_id) instead of using row numbers. That could speed this process up. But well, I guess you already notice: this is not an algorithm made for SQL.
Another approach... Avoid the problem by cleansing the data.
Make sure the rank is available for every day. When a new date comes in, find the previous rank, then fill in all the rows for the intervening days.
(This will take some initial effort to 'fix' all the previous missing dates. After that, it is a small effort when a new list of ranks comes in.)
The "report" would be a simple JOIN on the date. You would probably need a 2-column INDEX(movie_id, date) or something like that.
Ultimate solution would be not to calculate all the ranks every time, but store them (in a new column, or even in a new table if you don't want to change existing tables).
Each time you update you could look for sales data without rank and calculate only for those.
With above approach you get rank always from last available rank BEFORE sales data (e.g. if you've data 14 days before and 1 days after, still the one before would be used)
If you strictly need to use ranking closest in time, then you need to run UPDATE also for newly arrived ranking info. I believe it would still be more efficient in the long run.
I've looked at other answers to this question but haven't found a solution.
I have two tables with a tracking number, one has status history and several records per tracking with different date times for each status. The other table is a cost table that has one record per tracking with a date time that is in the same general time period of the status table but never exact.
I cannot join just on the tracking number itself due to the duplication of the tracking number in the data from months prior. Ex. a tracking number may appear in March of 2019 and again in January of 2020 even though they are very different parcels being shipped. However if you concatenate the tracking with the orderid on the status table you do get a unique value. That orderid number though is not in the cost table so you cannot join the two tables on that value either. It has to be tracking and a date range of some sort.
So I am looking to join the two tables using the tracking number and a date range of +- 30 days from the date provided on the cost table and the final date for that tracking number on the status table.
So something like this without the "is in a 30 day window" part clearly.
SELECT C.cost
, S.trackingnumber
From UPSCost C
join UPSStatus S
ON C.trackingnumber = S.trackingnumber
WHERE MAX(S.date_time) is in a 30 day window of C.event_date_time
You could expand your join and add the date condition to it. Something like this.
SELECT
C.cost,
S.trackingnumber
From UPSCost C
join UPSStatus S
ON(
-- Same tracking number
C.trackingnumber = S.trackingnumber AND
-- status updated within -+30 days from the date found in cost table
s.date_time between DATE_SUB(C.event_date_time, interval 30 day) AND DATE_ADD(C.event_date_time, interval 30 day)
)
Order by S.date_time desc -- latest status first?
We have a business that rents out international phone numbers to customers when traveling. When a customer makes an order We want to display to the customer the available phone numbers for his booking dates based on his start_date and end_date and numbers which is not occupied yet.
Since these phone numbers are rented out, I need to select from the table ONLY those numbers that are not rented out yet for dates that would interfere with the current customers dates.
I also don't want to rent out any phone number prior to 7 days after its end date. Meaning, If a customer booked a phone number for 1-1-2020 through 1-20-2020, I don't want this phone number to be booked by another customer before 1-27-2020. I want the phone number to have a 7 day window of being clear.
I have a table with the phone numbers and a table with the orders that is related to the phone numbers table via phone_number_id. The orders table has the current customers start_date and end_date for travel without the phone number id saved yet to it. The orders table also has the start_date and end_date for all other customers dates of travel as well as which phone_number_id was assigned/booked up for their travel dates.
How would the MySQL query look like when trying to select the phone numbers that are available for the current customers dates?
I build below query at the moment
SELECT x.id
, x.area_code
, x.phone_number
, y.start_date
, y.end_date
FROM vir_num_table x
LEFT
JOIN orderitemsdetail_table y
ON y.vn_id = x.id
WHERE y.start_date BETWEEN '2020-01-11' AND '2020-01-18'
OR y.start_date IS NULL
I've build this query but stuck here how can I add end_date logic.
Any help would be appreciated! Thanks in advance.
The way I'd approach the problem would be to look at conceptually, is as a cross product of the set of all phone numbers, along with the reservation timeframe, and then exclude those where there's a conflicting reservation.
A conflict would be an overlap, existing reservation that has a start_date before the end of the proposed reservation AND has an end_date on or after the start of the proposed reservation.
I'd do an anti-join pattern, something like this:
SELECT pn.phone_number
FROM phone_number pn
LEFT
JOIN reservation rs
ON rs.phone_number = pn.phone_number
AND rs.start_dt <= '2019-12-27' + INTERVAL +7 DAY
AND rs.end_dt > '2019-12-20' + INTERVAL -7 DAY
WHERE rs.phone_number IS NULL
That essentially says get all rows from phone number, along with matching rows from reservations (rows that overlap), but then exclude all the rows that had a match, leaving just phone_number rows that did not have a match.
We can make the < test a <= or , subtract 8 days, to tailor the "7 day" window before; we can tweak as we run the query through the test cases,
We can achieve an equivalent result using a NOT EXISTS and a correlated subquery. Some people find this easier to comprehend than the ant-join, but its essentially the same query, doing the same thing, get all rows from phone_number but exclude the rows where there is a matching (overlapping) row in reservation
SELECT pn.phone_number
FROM phone_number pn
WHERE NOT EXISTS
( SELECT 1
FROM reservation rs
WHERE rs.phone_number = pn.phone_number
AND rs.start_dt <= '2019-12-27' + INTERVAL +7 DAY
AND rs.end_dt > '2019-12-20' + INTERVAL -7 DAY
)
There are several questions on StackOverflow about checking for overlap, or no overlap, of date ranges.
See e.g.
How to check if two date ranges overlap in mysql?
PHP/SQL - How can I check if a date user input is in between an existing date range in my database?
MySQL query to select distinct rows based on date range overlapping
EDIT
Based on the SQL added as an edit to the question, I'd do the query like this:
SELECT pn.`id`
, pn.`area_code`
, pn.`phone_number`
FROM `vir_num_table` pn
LEFT
JOIN `orderitemsdetail_table` rs
ON rs.vn_id = pn.id
AND rs.start_date <= '2020-01-18' + INTERVAL +7 DAY
AND rs.end_date > '2020-01-11' + INTERVAL -7 DAY
WHERE rs.vn_id IS NULL
The two "tricky" parts. First is the anti-join, understanding how that works. (An outer join, to return all rows from vir_num_table but exclude any rows that have a matching row in reservations. The second tricky part is checking for the overlap, coming up with the conditions: r.start <= p.end AND r.end >= p.start, then tweaking whether we want to include the equals as an overlap, and tweaking the extra seven days (easiest to me to just subtract the 7 days from the beginning of the proposed reservation)
... now occurs to me like we need to add a guard period of 7 days on the end of the reservation period as well, doh!
Here's a query plus sorting algo to choose the optimal phone number selection for maximum utilization efficiency (i.e. getting as close as possible to exactly 7 days before and after each use).
I set it to give open ends a weight of 9, so that "near perfect" fits (7-8 days before or after) would be selected ahead of open-ended numbers. This will yield a slight efficiency improvement, as open numbers can accommodate any reservation. You can adjust this for your needs. If you set this to 0, for example, it would always select open numbers first.
SELECT ph.phone_number,
COALESCE(
MIN(
IF(res.end_date > res.start_date > '2020-01-18',
NULL, -- ignore before-comparison for reservations starting and ending after date range
DATEDIFF('2020-01-11', res.end_date)
), 9) AS open_days_before,
COALESCE(
MIN(
IF(res.start_date < res.end_date < '2020-01-11',
NULL, -- ignore after-comparison for reservations starting and ending before date range
DATEDIFF(res.start_date, '2020-01-18')
), 9) AS open_days_after
FROM phone_number ph
LEFT JOIN reservation res
ON res.phone_number = ph.phone_number
AND res.end_date >= CURRENT_DATE() - INTERVAL 6 DAY
GROUP BY ph.phone_number
HAVING open_days_before >= 7
AND open_days_after >= 7
ORDER BY open_days_before + open_days_after
LIMIT 1
Edit: updated to add grouping, because I realize this is an aggregate problem.
Edit 2: bug fix, changed MAX to MIN
Edit 3: added res.end_date >= CURRENT_DATE - INTERVAL 6 DAY to ignore past reservations, limiting aggregate data and treating phone number with no reservations between 6 days ago and the beginning of the new order as "open on the front-end"
Edit 4: added IF conditions to eliminate reservations outside the given before-or-after comparison ranges (e.g. comparing reservations after the selected range from influencing the "open days before" number), to prevent negative numbers, except when there's overlap with the selected range.
Based on the info you've added then you shouldn't need to check the start date of phone numbers which have been booked out.
You customer provides you with a start date and an end date.
You only rent out phone numbers 7 days after their last lease ended
All you need to do is fetch back phone numbers which either:
- Are not rented out and therefor aren't in the orderitems table
- OR have an end_date which is 7 days before the new customer's start date.
Here you go:
SELECT
`main_table`.`id`,
`main_table`.`area_code`,
`main_table`.`phone_number`,
`orderitemsdetail_table`.`start_date`,
`orderitemsdetail_table`.`end_date`
FROM
`vir_num_table` AS `main_table`
LEFT JOIN
`orderitemsdetail_table` AS `orderitemsdetail_table` ON main_table.id = orderitemsdetail_table.vn_id
WHERE
(DATE_ADD(orderitemsdetail_table.end_date, INTERVAL 7 DAY) < '<CUSTOMER START DATE>'
AND orderitemsdetail_table.start_date > '<CUSTOMER END DATE>')
OR orderitemsdetail_table.id IS NULL
I have kind of an interesting situation that I will try my best to explain.
I have a table called appointments in that table holds many appointments that a sales person can have with a potential customer. The relationship between appointments to salespeople is many to one and it is the same for potential customers.
I need to count how many appointments a salesperson has set with a lead when that salesperson has never set an appointment with that lead before.
Here is how far I have gotten in the code (I'm trying to see how many appointments a salesperson set yesterday, hence the date scrub):
SELECT COUNT(DISTINCT lead)
FROM appointments
WHERE status = 3
and DATE(appointment_created_at) = CURDATE() - interval 1 day
AND creator = 'xxx';
(the column creator represents the individual sales person and the column lead represents the individual potential customer)
The problem with this SQL query is that if a salesperson is resetting an appointment with a lead they have already set an appointment with, it still counts it as a "set appointment".
How can I count the number of rows in my appointments table without counting leads who have already been set before?
You can utilize NOT EXISTS() to check if an appointment already exists earlier or not.
SELECT COUNT(DISTINCT a1.lead)
FROM appointments a1
WHERE a1.status = 3
and a1.appointment_created_at >= CURRENT_DATE() - INTERVAL 1 DAY
AND a1.appointment_created_at < CURRENT_DATE()
AND a1.creator = 'xxx'
AND NOT EXISTS (SELECT 1
FROM appointments a2
WHERE a2.creator = 'xxx'
AND a2.lead = a1.lead
AND a2.appointment_created_at < a1.appointment_created_at)
For good performance, for the Correlated subquery in the NOT EXISTS() portion, you can use the following composite index: (creator, lead, appointment_created_at)
And, for the main select query, you can add the following the composite index: (creator, status, appointment_created_at)
If you want the number of "first-time" appointments, you can use row_number() or a correlated subquery:
SELECT COUNT(*)
FROM appointments a
WHERE a.status = 3 AND
a.appointment_created_at >= CURDATE() - interval 1 day AND
a.appointment_created_at < CURDATE() AND
a.creator = 'xxx' AND
a.appointment_created_at = (SELECT MIN(a2.appointment_created_at)
FROM appointments a2
WHERE a2.creator = a.creator AND
a2.lead = a.lead
);
Notice that I changed the date comparisons so an index can be used for the WHERE clause. If you care about performance, you want indexes on:
appointments(creator, status, appointment_created_at, lead)
appointments(creator, lead, appointment_created_at).
If the sales people can reschedule appointments then you are going to need an additional field to store original appointment date, at least. There are other more complex solutions, but this is probably the easiest approach.
I have 2 tables, one with hostels (effectively a single-room hotel with lots of beds), and the other with bookings.
Hostel table: unique ID, total_spaces
Bookings table: start_date, end_date, num_guests, hostel_ID
I need a (My)SQL query to generate a list of all hostels that have at least num_guests free spaces between start_date and end_date.
Logical breakdown of what I'm trying to achieve:
For each hostel:
Get all bookings that overlap start_date and end_date
For each day between start_date and end_date, sum the total bookings for that day (taking into account num_guests for each booking) and compare with total_spaces, ensuring that there are at least num_guests spaces free on that day (if there aren't on any day then that hostel can be discounted from the results list)
Any suggestions on a query that would do this please? (I can modify the tables if necessary)
I built an example for you here, with more comments, which you can test out:
http://sqlfiddle.com/#!9/10219/9
What's probably tricky for you is to join ranges of overlapping dates. The way I would approach this problem is with a DATES table. It's kind of like a tally table, but for dates. If you join to the DATES table, you basically break down all the booking ranges into bookings for individual dates, and then you can filter and sum them all back up to the particular date range you care about. Helpful code for populating a DATES table can be found here: Get a list of dates between two dates and that's what I used in my example.
Other than that, the query basically follows the logical steps you've already outlined.
Ok, if you are using mysql 8.0.2 and above, then you can use window functions. In such case you can use the solution bellow. This solution does not need to compute the number of quests for each day in the query interval, but only focuses on days when there is some change in the number of hostel guests. Therefore, there is no helping table with dates.
with query as
(
select * from bookings where end_date > '2017-01-02' and start_date < '2017-01-05'
)
select hostel.*, bookingsSum.intervalMax
from hostel
join
(
select tmax.id, max(tmax.intervalCount) intervalMax
from
(
select hostel.id, t.dat, sum(coalesce(sum(t.gn),0)) over (partition by t.id order by t.dat) intervalCount
from hostel
left join
(
select id, start_date dat, guest_num as gn from query
union all
select id, end_date dat, -1 * guest_num as gn from query
) t on hostel.id = t.id
group by hostel.id, t.dat
) tmax
group by tmax.id
) bookingsSum on hostel.id = bookingsSum.id and hostel.total_spaces >= bookingsSum.intervalMax + <num_of_people_you_want_accomodate>
demo
It uses a simple trick, where each start_date represents +guest_num to the overall number of quests and each 'end_date' represents -guest_num to the overall number of quests. We than do the necessary sumarizations in order to find peak number of quests (intervalMax) in the query interval.
You change '2017-01-05' in my query to '2017-01-06' (then only two hostels are in the result) and if you use '2017-01-07' then just hostel id 3 is in the result, since it does not have any bookings yet.