MySQL: Optimizing query for calculation of financial BETA - mysql

in finance, a stock's beta is the covariance between the stock's daily returns and an index' daily returns divided by the variance of the index daily returns. I try to calaculate beta for set of stocks and a set of indices.
Here's my query for a 50 business day rolling window and I'd like you to help me optimize it for speed:
INSERT INTO betas (permno, index_id, DATE, beta)
(SELECT
permno, index_id, s.date, IF(
s.`seq` >= 50,
(SELECT
(AVG(s2.log_return*i2.log_return)-AVG(s2.log_return)*AVG(i2.log_return))/VAR_POP(i2.log_return) AS beta
FROM
stock_series s2
INNER JOIN `index_series` i2 ON i2.date=s2.date
WHERE i2.index_id=i.index_id AND s2.permno = s.permno
AND s2.`seq` BETWEEN s.`seq` - 49 AND s.`seq`
GROUP BY index_id,permno), NULL)
AS beta
FROM
stock_series s
INNER JOIN `index_series` i ON i.index_id IN ('SP500') AND i.date=s.date
)
ON DUPLICATE KEY
UPDATE beta= VALUES (beta)
Both main tables are already ordered by entity and date in ascending order, and they already include log daily returns as well as a "seq" column. Seq sequentally enumerates all daily rows company- (or index-) wise, i.e. seq starts over at 1 for every new stock or index in the table and counts up to the number of total number of rows for a given entity. I created it to allow for the rolling window.
As of now, with 500 firms and 1 index, the query takes like forever to complete.
Let me know any optimization that comes to your mind, like views, stored procs, temp tables, and if you find any inconsistencies, of course.
EDIT: Indexes:
stock_series has PRIMARY KEY (permno,date) and UNIQUE KEY (permno,seq),
index_series has PRIMARY KEY (index_id,date)
EXPLAIN EXTENDED results for ONE company (by including a WHERE s.permno=... restriction at the end):
EXPLAIN EXTENDED results for ALL ~500 companies:

Here i what the pros do: do NOT calcualte that in the databae. Pull data, calculate, reinsert. Where I work now they have a hugh grid doing that stuff in the end of day run. Yes, gris - like in a significant number of machines. We talk of producing gigabytes of csv files that then get reloaded into the database. Beta, Gamma, PnL on trades with 120.000 different elements. Databaes are NOT optimized for this.

Related

SQL Capture duplicate records across two DIFFERENT columns

I am writing an Exception Catching Page using MySQL for catching duplicate billing entries the following scenario.
Items details are entered in a table which has the following two columns (among others).
ItemCode VARCHAR(50), BillEntryDate DATE
It often happens that same item's bill is entered multiple times, but over a period of few days. Like,
"Football","2019-01-02"
"Basketball","2019-01-02"
...
...
"Football","2019-01-05"
"Rugby","2019-01-05"
...
"Handball","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"
In the above example, the item Football is billed twice - first on 2Jan and again on 5Jan. Similarly, item Rugby is billed thrice on 5,7,10Jan.
I am looking to write simple SQL which can pickup each item [say, using distinct(ItemCode) clause], and then display all the records which are duplicates over a period of 30 days.
In the above case, the expected output should be the following 5 records:
"Football","2019-01-02"
"Football","2019-01-05"
"Rugby","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"
I am trying to run the following SQL:
select * from tablen a, tablen b, where a.ItemCode=b.ItemCode and a.BillEntryDate = b.BillEntryDate+30;
However, this seems to be highly inefficient as it is running for long without displaying any records.
Is there any possibility for getting a less complex and faster method?
I did explore existing topics (like How do I find duplicates across multiple columns?), but it is catching duplicates where BOTH columns have same value. My requirement is one column same value, and second column varying over a month-long date range.
You can use:
select t.*
from tablen t
where exists (select 1
from tablen t2
where t2.ItemCode = t.ItemCode and
t2.BillEntryDate <> t.BillEntryDate and
t2.BillEntryDate >= t1.BillEntryDate - interval 30 day and t2.BillEntryDate <= t1.BillEntryDate + interval 30 day
);
This will pick up both duplicates in the pair.
For performance, you want an index on (ItemCode, BillEntryDate).
With EXISTS:
select ItemCode, BillEntryDate
from tablename t
where exists (
select 1 from tablename
where
ItemCode = t.ItemCode
and
abs(datediff(BillEntryDate, t.BillEntryDate)) between 1 and 30
)

SQL - Add To Existing Average

I'm trying to build a reporting table to track server traffic and popularity overall. Each SID is a unique game server hosting a particular game, and each UCID is a unique player key connecting to that server.
Say I have a table like so:
SID UCID AvgTime NumConnects
-----------------------------------------
1 AIE9348ietjg 300.55 5
1 Po328gieijge 500.66 7
2 AIE9348ietjg 234.55 3
3 Po328gieijge 1049.88 18
We can see that there are 2 unique players, and 3 unique servers, with SID 1 having 2 players that have connected to it at some point in the past. The AvgTime is the average amount of time those players spent on that server (in seconds), and the NumConnects is the size of the average (ie. 300.55 is averaged out of 5 elements).
Now I run a job in the background where I process a raw connection table and pull out player connections like so:
SID UCID ConnectTime DisconnectTime
-----------------------------------------
1 AIE9348ietjg 90.35 458.32
2 Po328gieijge 30.12 87.15
2 AIE9348ietjg 173.12 345.35
This table has no ID or other fluff to help condense my example. There may be multiple connect/disconnect records for multiple players in this table. What I want to do is add to my existing AvgTime for each SID these new values.
There is a formula from here I am trying to use (taken from this math stackexchange: https://math.stackexchange.com/questions/1153794/adding-to-an-average-without-unknown-total-sum/1153800#1153800)
Average = (Average * Size + NewValue) / Size + 1
How can I write an update query to update each ServerIDs traffic table above, and add to the average using the above formula for each pair of records. I tried something like the following but it didn't work (returned back null):
UPDATE server_traffic st
LEFT JOIN connect_log l
ON st.SID = l.SID AND st.UCID = l.UCID
SET AvgTime = (AvgTime * NumConnects + SUM(l.DisconnectTime - l.ConnectTime) / NumConnects + COUNT(l.UCID)
I would prefer an answer in MySql, but I'll accept MS SQL as well.
EDIT
I understand that statistics and calculations are generally not to be stored in tables and that you can run reports that would crunch the numbers for you. My requirement is that users can go to a website and view the popularity of various servers. This needs to be done in a way that
A: running a complex query per user doesn't crash or slow down the system
B: the page returns the data within a few seconds at most
See this example here: https://bf4stats.com/pc/shinku555555
This is a web page for battlefield 4 stats - notice that the load is almost near instant for this player, and I get back a load of statistics without waiting for some complex report query to return the data. I'm assuming they store these calculations in preprocessed tables where the webpage just needs to do a simple select to return back the values. That's the same approach I want to take with my Database and Web Application design.
Sorry if this is off topic to the original question - but hopefully this adds additional context that helps people understand my needs.
Since you cannot run aggregate functions like SUM and COUNT by themselves at the unit level in SQL but contained in an aggregate query, consider joining to an aggregate subquery for the UPDATE...LEFT JOIN. Also, adjust parentheses in SET to match above formula.
Also, note that since you use LEFT JOIN, rows with non-match IDs will render NULL for aggregate fields and this entity cannot be used in arithmetic operations and will return NULL. You can convert to zero with IFNULL() but may fail with formula's division.
UPDATE server_traffic s
LEFT JOIN
(SELECT SID, UCID, COUNT(UCID) As GrpCount,
SUM(DisconnectTime - ConnectTime) AS SumTimeDiff
FROM connect_log
GROUP BY SID, UCID) l
ON s.SID = l.SID AND s.UCID = l.UCID
SET s.AvgTime = (s.AvgTime * s.NumConnects + l.SumTimeDiff) / s.NumConnects + l.GrpCount
Aside - reconsider saving calculations/statistics within tables as they can always be run by queries even by timestamps. Ideally, database tables should store raw values.

Relational Database Logic

I'm fairly new to php / mysql programming and I'm having a hard time figuring out the logic for a relational database that I'm trying to build. Here's the problem:
I have different leaders who will be in charge of a store anytime between 9am and 9pm.
A customer who has visited the store can rate their experience on a scale of 1 to 5.
I'm building a site that will allow me to store the shifts that a leader worked as seen below.
When I hit submit, the site would take the data leaderName:"George", shiftTimeArray: 11am, 1pm, 6pm (from the example in the picture) and the shiftDate and send them to an SQL database.
Later, I want to be able to get the average score for a person by sending a query to mysql, retrieving all of the scores that that leader received and averaging them together. I know the code to build the forms and to perform the search. However, I'm having a hard time coming up with the logic for the tables that will relate the data. Currently, I have a mysql table called responses that contains the following fields,
leader_id
shift_date // contains the date that the leader worked
shift_time // contains the time that the leader worked
visit_date // contains the date that the survey/score was given
visit_time // contains the time that the survey/score was given
score // contains the actual score of the survey (1-5)
I enter the shifts that the leader works at the beginning of the week and then enter the survey scores in as they come in during the week.
So Here's the Question: What mysql tables and fields should I create to relate this data so that I can query a leader's name and get the average score from all of their surveys?
You want tables like:
Leader (leader_id, name, etc)
Shift (leader_id, shift_date, shift_time)
SurveyResult (visit_date, visit_time, score)
Note: omitted the surrogate primary keys for Shift and SurveyResult that I would probably include.
To query you join shifts and surveys group on leader and taking the average then jon that back to leader for a name.
The query might be something like (but I haven;t actually built it in MySQL to verify syntax)
SELECT name
,AverageScore
FROM Leader a
INNER JOIN (
SELECT leader_id
, AVG(score) AverageScore
FROM Shift
INNER JOIN
SurveyResult ON shift_date = visit_date
AND shift_time = visit_time --depends on how you are recording time what this really needs to be
GROUP BY leader ID
) b ON a.leader_id = b.leader_id
I would do the following structure:
leaders
id
name
leaders_timetabke (can be multiple per leader)
id,
leader_id
shift_datetime (I assume it stores date and hour here, minutes and seconds are always 0
survey_scores
id,
visit_datetime
score
SELECT l.id, l.name, AVG(s.score) FROM leaders l
INNER JOIN leaders_timetable lt ON lt.leader_id = l.id
INNER JOIN survey_scores s ON lt.shift_datetime=DATE_FORMAT('Y-m-d H:00:00', s.visit_datetime)
GROUP BY l.id
DATE_FORMAT here helps to cut hours and minutes from visit_datetime so that it could be matched against shift_datetime. This is MYSQL function, so if you use something else you'll need to use different function
Say you have a 'leader' who has 5 survey rows with scores 1, 2, 3, 4 and 5.
if you select all surveys from this leader, sum the survey scores and divide them by 5 (the total amount of surveys that this leader has). You will have the average, in this case 3.
(1 + 2 + 3 + 4 + 5) / 5 = 3
You wouldn't need to create any more tables or fields, you have what you need.

Mysql group by query optimization

I have query like this in mysql
select count(*)
from (
select count(idCustomer)
from customer_details
where ...
group by idCustomer having max(trx_staus) = -1
) as temp
So basically finding customer count that fulfill certain where condition (one or two) with max transaction state = -1 (other can be 2,3,4). but this query takes about 30 min on my local machine and 13 sec on high configuration server (about 20 gb ram and 8 core processor). i have 13 lac rows in table. i know group by and having max function are too costly. what can i do to optimize this query. any suggestion?
The inner query has to inspect all rows to determine the aggregate maximum; if you want to optimize this, add a calculated field that contains the maximum to your customer table and select on that.
The trick is then to keep that field up-to-date :)

MySQL query puzzle - finding what WOULD have been the most recent date

I've looked all over and haven't yet found an intelligent way to handle this, though I feel sure one is possible:
One table of historical data has quarterly information:
CREATE TABLE Quarterly (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (unique_ID));
Another table of historical data (which is very large) contains daily information:
CREATE TABLE Daily (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
qtr_ID INT UNSIGNED,
PRIMARY KEY (unique_ID));
The qtr_ID field is not part of the feed of daily data that populated the database - instead, I need to retroactively populate the qtr_ID field in the Daily table with the Quarterly.unique_ID row ID, using what would have been the most recent quarterly data on that Daily.date_posted for that data source.
For example, if the quarterly data is
101 2009-03-31 1 4.5
102 2009-06-30 1 4.4
103 2009-03-31 2 7.6
104 2009-06-30 2 7.7
105 2009-09-30 1 4.7
and the daily data is
1001 2009-07-14 1 3.5 ??
1002 2009-07-15 1 3.4 &&
1003 2009-07-14 2 2.3 ^^
then we would want the ?? qtr_ID field to be assigned '102' as the most recent quarter for that data source on that date, and && would also be '102', and ^^ would be '104'.
The challenges include that both tables (particularly the daily table) are actually very large, they can't be normalized to get rid of the repetitive dates or otherwise optimized, and for certain daily entries there is no preceding quarterly entry.
I have tried a variety of joins, using datediff (where the challenge is finding the minimum value of datediff greater than zero), and other attempts but nothing is working for me - usually my syntax is breaking somewhere. Any ideas welcome - I'll execute any basic ideas or concepts and report back.
Just subquery for the quarter id using something like:
(
SELECT unique_ID
FROM Quarterly
WHERE
datasource = ?
AND date_posted >= ?
ORDER BY
unique_ID ASC
LIMIT 1
)
Of course, this probably won't give you the best performance, and it assumes that dates are added to Quarterly sequentially (otherwise order by date_posted). However, it should solve your problem.
You would use this subquery on your INSERT or UPDATE statements as the value of your qtr_ID field for your Daily table.
The following appears to work exactly as intended but it surely is ugly (with three calls to the same DATEDIFF!!), perhaps by seeing a working query someone might be able to further reduce it or improve it:
UPDATE Daily SET qtr_ID = (select unique_ID from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) =
(SELECT MIN(DATEDIFF(Daily.date_posted, Quarterly.date_posted)) from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) > 0));
After more work on this query, I ended up with enormous performance improvements over the original concept. The most important improvement was to create indices in both the Daily and Quarterly tables - in Daily I created indices on (datasource, date_posted) and (date_posted, datasource) USING BTREE and on (datasource) USING HASH, and in Quarterly I did the same thing. This is overkill but it made sure I had an option that the query engine could use. That reduced the query time to less than 1% of what it had been. (!!)
Then, I learned that given my particular circumstances I could use MAX() instead of ORDER BY and LIMIT so I use a call to MAX() to get the appropriate unique_ID. That reduced the query time by about 20%.
Finally, I learned that with the InnoDB storage engine I could segment the chunk of the Daily table that I was updating with any one query, which allowed me to multi-thread the queries with a little elbow-grease and scripting. The parallel processing worked well and every thread reduced the query time linearly.
So, the basic query that is performing literally 1000 times better than my own first attempt is:
UPDATE Daily
SET qtr_ID =
(
SELECT MAX(unique_ID)
FROM Quarterly
WHERE Daily.datasource = Quarterly.datasource AND
Daily.date_posted > Quarterly.dateposted
)
WHERE unique_ID > ScriptVarLowerBound AND
unique_ID <= ScriptVarHigherBound
;