I want to store daily fund data for approximately 2000 funds over 20 years or more. At first I figured I would just create one giant table with one column per fund and one row per date. I ran into trouble trying to create this table and also realise that a table like that would have a lot of NULL values (almost half the values would be NULL).
Is there a more efficient way of structuring the table or database for quickly finding and fetching the data for a specific fund over hundreds (or thousands) of days?
The alternative way I've thought of doing this is with three columns (date, fund_id, fund_value). This however does not seem optimal to me since both the date and fund_id would be duplicated many times over. Having a few million data points just for the date (instead of a few thousand) seems wasteful.
Which is the better option? Or is there a better way to accomplish this?
Having the three columns you mention is fine. fund_value is the price of fund_id on fund_date. So fund_id and fund_date would be the PK of this table. I don't understand what you mean "having a few million data points just for the date..." If you have 20k funds, a particular date will appear in at most 20k rows -- one for each fund. This is not needless duplication. This is necessary to uniquely identify the value of a particular fund on a particular date. If you added, say, fund_name to the table, that would be needless duplication. We assume the fund name will not change from day to day. Unchanging (static) data about each fund would be contained in a separate table. The field fund_id of this table would then be a FK reference to the static table.
To query the value of the funds on a particular date:
select fund_date as ValueDate, fund_id, fund_value
from fund_value_history
where fund_date = #aDate
and fund_id = #aFund -- to limit to a particular fund
To show the dates a fund increased in value from one day to the next:
select h1.fund_date, h2.fund_value as PreviousValue,
h1.fund_value PresentValue
from fund_value_history h1
join fund_value_history h2
on h2.fund_id = h1.fund_id
and h2.fund_date =(
select max( fund_date )
from fund_value_history
where fund_id = h2.fund_id
and fund_date < h2.fund_date )
where h2.fund_value < h1.fund_value
and fund_id = #aFund;
This would be a sizable result set but you could modify the WHERE clause to show, for example, all funds whose values on a particular date was greater than the previous day, or the values of all funds (or particular fund) on a particular date and the previous day, or any number of interesting results.
You could then join to the static table to add fund name or any other descriptive data.
The three column approach you considered is the correct one. There would be no wasted space due to missing values, and you can add and remove funds at any time.
Have a search for "database normalisation", which is the discipline that covers this sort of design decision.
Edit: I should add that you're free to include other metrics in that table, of course. Since historical data is effectively static you can also store "change since previous day" as well, which is redundant strictly speaking, but may help to optimise certain queries such as "show me all the funds that decreased in value on this day".
Related
Database Design of Customer Tax Report
I need to maintain a set of info about Customers and their expenses during the year.
FaceValue, Rent, UtilitiesCost are the tables which stores expenses during the year.
The last column stores Sum of the year. Is that a good practice?
CustomerTaxReport table needs all the total value of each table against CustomerNo.
Storing information like this is difficult to maintain. If there's a modification in one record, then I have to re-calculate all required records.
What is the best way to overcome this problem? If I use Triggers then, I think it will cause performance issues.
Here's what I would recomend to change:
Move Month values from tables' columns to rows. For this just add some month and fill it with the start of corresponding month: '2020-03-01' for the March. Add value column to store amount
If you need a year as a separate period - jus add a new column and fill either month or year
Make CustomerTaxReport query instead of a table
It is most likely int is not a good type for storing amounts of money.
Context: I have three tables.
Table #1: Tbl_TraumaCodes: It marks all the dates, times, and hospital beds where a medical team is alerted to go treat a patient with a serious traumatic injury.
Table #2: Tbl_Location: Lists the date, time, and location (area of the hospital, bed number) where a patient was with an identifying number.
Table #3: Tbl_Outcomes: Has an identifying number paired with discharge outcomes.
Task: I need to with a reasonable amount of surety, match records in Tbl_TraumaCodes with Tbl_Outcomes.
Matching Tbl_Location and Tbl_Outcomes is easy and automatic through a matching query using the identifying number. Matching Tbl_Location records with Tbl_Trauma Codes will create the link I need.
I designed a look-up table in Tbl_Location where the date, time, and location of records from Tbl_ TraumaCode appears so that I can match them. However, the times that are supposed to correspond between Table_Location and Table_TraumaCode are not exactly the same. The times are roughly within the same ballpark (usually 30 +/- min).
Problem: I have thousands of records to match. There may only be 10 records on a given day, which allows me to limit the options when I type in, say, July 1st in the look-up table. Not every item in Tbl_Location with have a matching item in Tbl_TraumaCode. That means I have to match 10 records when there may be 40 extra record to work with. It’s incorrect to assign an item (time) in Table_TraumaCode to more than one item in the Table_Location. My goal is to reduce the potential for human error.
Is there a way to make the records from the look-up table that are already assigned to a record within Tbl_Location NOT display in the look-up field? I thought about drawing the look-up table from a query, but I don’t know how I would create a TraumaCode query that only displays records that aren’t matched in another table. I also don't know if it would impact the previously assigned records.
I avail myself of the collective wisdom and humbly thank you.
Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.
So I've been dreading asking this question - mostly because I'm terrible at logic in excel, and transferring logic statements to SQL is such a struggle for me, but I'm going to try and make this as clear as possible.
I have two tables. One table is historic_events and the other is future_events. Based on future_events, I have another table confidence_interval that calculates a z-score telling me based on how many future_events will occur, how many historic_event data points I will need to calculate a reliable average. Each record in historic_events has a unique key called event_id. Each record in confidence_interval has a field called service_id that is unique. The 'service_id' field also exists in 'historic_events' and they can be joined on that field.
So, with all that being said, based on the count of future events by service_id, my confidence_interval table calculates the z-score. I then need to select records from the historic_events table for each service_id that satisfy the following parameters
Select * EVENT_ID
From historic_events
where END_DATE is within two calendar years from todays date
and count of `EVENT_ID` is >= `confidence_interval.Z_SCORE`
if those parameters are not met, then I want to widen the date value to being within three years.
if those parameters are still not met, I want to widen the date value to being within four years, and then again to five years. If there still aren't enough datapoints after five years, ohwell, we'll settle for what we have. We do not want to look at datapoints that are older than five years.
I want my end result to be a table that has a list of the EVENT_ID and I would re-run the SQL query for each service_id.
I hope this makes sense - I can figure out the SELECT and FROM, but totally getting stuck on the WHERE.
I have an order table that contains dates and amounts for each order, this table is big and contains more that 1000000 records and growing.
We need to create a set of queries to calculate certain milestones, is there a way in mysql to figure out on which date we reached an aggregate milestone of x amount.
For e.g we crossed 1 m sales on '2011-01-01'
Currently we scan the entire table then use the logic in PHP to figure out the date, but it would be great if this could be done in mysql without reading so many records at 1 time.
There maybe elegant approaches, but what you can do is maintain a row in another table which contains, current_sales and date it occurred. Every time you have a sale, increment the value, and store sales date. If the expected milestones(1 Million, 2 Million etc) are known in advance, you can store them away when they occur(in same or different table)
i think using gunner's logic with trigger will be a good option as it reduce your efforts to maintain the row and after that you can send mail notification through trigger to know the milestone status