!!! ---------------------!!!
After exchanging some comments with people, I have decided to include the source data, so anyone who wishes to help with solving this could load it into a table and run the query.
Here is the link that retrieves data from Yahoo Finance for symbol SPY in csv format.
https://query1.finance.yahoo.com/v7/finance/download/SPY?period1=1584963550&period2=1616499550&interval=1d&events=history&includeAdjustedClose=true
The file header needs to be changed. Change Date to Source_Date and Adj Close to Adj_Close. The file does not have a Symbol column, so the query doesn't need to reference it. The only two relevant columns are Source_Date and Adj_Close.
!!! ---------------------!!!
The issue with my query is that it takes a very long time to run. The query is not wrong. I know exactly why it takes such a long time to run. I just couldn't come up with anything more efficient.
First, here is the business logic.
Let's say I bought Apple stock and it went down. It's been 10 days since I bought it and it is still down. I extracted an entire history of daily prices for Apple going back to 1993, and I uploaded it into a database table. Now I want to write a query that will tell me how often did Apple stock price recovered in more than 10 days.
For example. I bought Apple at $100. It went down to $90. It's been 10 days since I bought it. I run my query and it comes back with something like this:
-- Buy Date: April 1, 2001. Buy price: $10. Recovered on: April 12, 2001. Days to recovery: 12
-- Buy Date: June 12, 2006. Buy price: $23. Recovered on: July 20, 2001. Days to recovery: 38
-- Buy Date: January 15, 2009. Buy price: $65. Recovered on: December 30, 2010. Days to recovery: 700
Each example indicates that on each day between Buy Date and Recovered on date, the price stayed below the buy price.
My query has two steps.
The first step. Join the same table to itself, qualified as x and y. For each x.record it searches for the next minimum date where y.price is higher than x.price.
The second query will simply extract only those records from the first query result set where the difference between x.date and y.date is greater than 10.
The reason why the first query (see below) runs for such a long time is because for each record in x.table the query must search an entire y.table. That's a lot of table scans. It comes back with the result but it takes between 50 and 60 seconds.
The table stricture is very simple: Symbol, Date, and Price. Symbol and date are primary key.
SELECT x.Symbol,
x.Source_Date as 'Source_Date',
min(y.Source_Date) as 'Recovery_Date'
FROM transformed_source x,
transformed_source y
where x.Symbol = 'AAPL'
and y.Symbol = x.Symbol
and y.Source_Date > x.Source_Date
and y.Adj_Close > x.Adj_Close
group by x.Symbol, x.Source_Date
*** Note This query will miss records where the price never recovered, so I will need to modify it with an outer join. It wouldn't make any difference. Changing it to outer join will not make it run any faster. So, working with this.
Any ideas are welcome.
Thank you
You might find that a correlated subquery is better:
SELECT ts.Symbol, ts.Source_Date,
(SELECT MIN(ts2.Source_Date)
FROM transformed_source ts2
WHERE ts2.Symbol = t.Symbol AND
ts2.Source_Date > t.Source_Date AND
ts2.Adj_Close > t.Adj_Close
) as Recovery_Date
FROM transformed_source ts
WHERE ts.Symbol = 'AAPL';
Then for performance, you want indexes on transformed_data(symbol, source_date, adj_close).
If you're using an RDBMS that provides pattern matching (e.g. Oracle with match recognise) then you can write a query which will do one pass of your table.
I've put together a DBFiddle
Related
I am trying to pull a report from a data set , the conditions are as follow:
Customer A,B and C produced 100, 150 and 200 tickets respectively in a year.
A's period from 1/1/2022 till 3/30/2022
B's period from 1/10/2022 till 6/20/2022
C's period from 6/10/2022 till 9/5/2022
I want to pull how many cases each customer produced while they are in the incubation period. Such that the report will not include any cases outside the customers incubation period.
The start date and end date in available in a table.
Hopefully I was able to explain this, thanks for your help.
I would like to discuss the "best" way to storage date periods in a database. Let's talk about SQL/MySQL, but this question may be for any database. I have the sensation I am doing something wrong for years...
In english, the information I have is:
-In year 2014, value is 1000
-In year 2015, value is 2000
-In year 2016, there is no value
-In year 2017 (and go on), value is 3000
Someone may store as:
BeginDate EndDate Value
2014-01-01 2014-12-31 1000
2015-01-01 2015-12-31 2000
2017-01-01 NULL 3000
Others may store as:
Date Value
2014-01-01 1000
2015-01-01 2000
2016-01-01 NULL
2017-01-01 3000
First method validation rules looks like mayhem to develop in order to avoid holes and overlaps.
In second method the problem seem to filter one punctual date inside a period.
What my colleagues prefer? Any other suggestion?
EDIT: I used full year only for example, my data usually change with day granularity.
EDIT 2: I thought about using stored "Date" as "BeginDate", order rows by Date, then select the "EndDate" in next (or previous) row. Storing "BeginDate" and "Interval" would lead to hole/overlap problem as method one, that I need a complex validation rule to avoid.
It mostly depends on the way you will be using this information - I'm assuming you do more than just store values for a year in your database.
Lots of guesses here, but I guess you have other tables with time-bounded data, and that you need to compare the dates to find matches.
For instance, in your current schema:
select *
from other_table ot
inner join year_table yt on ot.transaction_date between yt.year_start and yt.year_end
That should be an easy query to optimize - it's a straight data comparison, and if the table is big enough, you can add indexes to speed it up.
In your second schema suggestion, it's not as easy:
select *
from other_table ot
inner join year_table yt
on ot.transaction_date between yt.year_start
and yt.year_start + INTERVAL 1 YEAR
Crucially - this is harder to optimize, as every comparison needs to execute a scalar function. It might not matter - but with a large table, or a more complex query, it could be a bottleneck.
You can also store the year as an integer (as some of the commenters recommend).
select *
from other_table ot
inner join year_table yt on year(ot.transaction_date) = yt.year
Again - this is likely to have a performance impact, as every comparison requires a function to execute.
The purist in me doesn't like to store this as an integer - so you could also use MySQL's YEAR datatype.
So, assuming data size isn't an issue you're optimizing for, the solution really would lie in the way your data in this table relates to the rest of your schema.
Have an RDB with a quantity x and the date that quantity started being tracked, date_1 and the date it was finished being tracked date_2. If tracking is still on going that second date is NULL obviously.
What I would like to do is take the number X and get its average over either date_1 and date_2. And if date_2 is NULL then go by current time. Any help?
[EDIT] to clarify in RDB format, one row with data column (x), data column (date_1) and data column (data_2) along with other fields of importance.
[EDIT] so imagine X as some integer like 100,000 and dates being March 30, 2016 12:29:45 and April 3, 2016 03:42:29. Not sure how to breakdown the date/times yet so open to suggestions. The end goal is calculate how much of x can be allocated in one month vs how much in the other month. Depending on how fine grain you breakdown the time frame (days vs seconds) will ultimately change those numbers.
Right now I am developing Hotel reservation system.
so I need to store prices on certain date/date range for future days, so the price varies on different days/dates. so I need to store those price & date details in to db. i thought of 2 structures .
1st model :
room_prices :
room_id :
from_date :
to_date :
price:
is_available:
2nd design:
room_prices :
room_id :
date:
price:
is_available
so i found the 2nd method is easy. but data stored grows exponentially
as my hotel list grows.
say suppose if i want to store next(future) 2 months price data for one hotel i need to create 60 records for every hotel.
in case of 1st design, i don't require that many records.
ex:
```
price values for X hotel :
1-Dec-15 - 10-Dec-15 -- $200,
1st design requires : only 1 row
2nd design requires : 10 rows
```
i am using mysql,
does it have any performance degradation while searching
room_prices table.
would someone suggest me any better design ?
I have actually worked on designing and implementing a Hotel Reservation system and can offer the following advice based on my experience.
I would recommend your second design option, storing one record for each individual Date / Hotel combination. The reason being that although there will be periods where a Hotel's Rate is the same across multiple days it is more likely that, depending on availability, it will change over time and become different (hotels tend to increase the room rate as the availability drops).
Also there are other important pieces of information that will need to be stored that are specific to a given day:
You will need to manage the hotel availability, i.e. on Date x there
are y rooms available. This will almost certain vary by day.
Some hotels have blackout periods where the hotel is unavailable for short periods of time (typically specific days).
Lead Time - some Hotels only allow rooms to be booked a certain
number of days in advance, this can differ between Week days and
Weekends.
Minimum nights, again data stored by individual date that says if you arrive on this day you must stay x number of nights (say over a weekend)
Also consider a person booking a week long stay, the database query to return the rates and availability for each day of that stay is a lot more concise if you have a pricing record for each Date. You can simply do a query where the Room Rate Date is BETWEEN the Arrival and Departure Date to return a dataset with one record per date of the stay.
I realise with this approach you will store more records but with well indexed tables the performance will be fine and the management of the data will be much simpler. Judging by your comment you are only talking in the region of 18000 records which is a pretty small volume (the system I worked on has several million and works fine).
To illustrate the extra data management if you DON'T store one record per day, imagine that a Hotel has a rate of 100 USD and 20 rooms available for the whole of December:
You will start with one record:
1-Dec to 31st Dec Rate 100 Availability 20
Then you sell one room on the 10th Dec.
Your business logic now has to create three records from the one above:
1-Dec to 9th Dec Rate 100 Availability 20
10-Dec to 10th Dec Rate 100 Availability 19
11-Dec to 31st Dec Rate 100 Availability 20
Then the rate changes on the 3rd and 25th Dec to 110
Your business logic now has to split the data again:
1-Dec to 2-Dec Rate 100 Availability 20
3-Dec to 3-Dec Rate 110 Availability 20
4-Dec to 9-Dec Rate 100 Availability 20
10-Dec to 10-Dec Rate 100 Availability 19
11-Dec to 24-Dec Rate 100 Availability 20
25-Dec to 25-Dec Rate 110 Availability 20
26-Dec to 31-Dec Rate 100 Availability 20
That is more business logic and more overhead than storing one record per date.
I can guarantee you that by the time you have finished your system will end up with one row per date anyway so you might as well design it that way from the beginning and get the benefits of easier data management and quicker database queries.
I think that the first solution is better and as you already noticed it reduce the number of storage you need to store prices. Another possible approach could be : having a single date and assuming that the date specified is valid up until a new date is found, so basically the same structure you designed in the second approach but implementing an internal logic in which you override a date if you find a new date for a specified period.
Let's say that you have ROOM 1 with price $200 from 1st December and with price $250 from 12 December, then you will have only two rows :
1-Dec-15 -- $200
12-Dec-15 -- $250
And you will assume in your logic that a price is valid from the specified date up until a new price is found.
Does an IF condition in the where clause of a MySQL query slow down the execution drastically?
Here is the one sample query:-
select * from alert_details_v adv
where (if(day(last_day(now()))<DAY(adv.alert_date),
day(last_day(now())),DAY(adv.alert_date))-adv.alert_trigger_days)<=day(now());
Sample data:
alert_id alert_date alert_trigger_days
==================================================
1 2013-09-14 00:00:00 6
2 2013-09-13 00:00:00 5
alert_date: Some user input date
alert_trigger_days: Number of days before the actual date the alert be triggered.
Brief about query logic:-
Here I am trying to find if the last day of the current month is less than the day of the alert_date (database column). Whichever day comes before would be considered.
Basically this table is meant for storing alert information. So if the user has chosen 30th of some month and the alert is recurring monthly then for February it would not find the day 30th and hence would not show the record.
My question is: does a query with if conditions (as in the sample query above) in where clause slows down the execution of the query drastically or slightly, if there are hundreds of thousands of records in the table?
This may entirely depend upon your table and data. Sometimes it may help in increasing the performance and sometimes it may degrade your performance.