Periodic snapshot fact table - Possibly missing some captures - mysql

I am tracking employee changes daily in a DimPerson dimension table, and filling up my fact table each end-of-month and counting Hires, Exits, and Headcount.
For this example, let's say I will be populating the fact table end-of-month April 30th. Now here's the problem I am facing:
I have an employee record on April 17th that's a "Hire" action, so at that point in time my DimPerson table reads like this:
+-------+-----------+----------+--------+--------------------+-------+
| EmpNo | Firstname | LastName | Action | EffectiveStartDate | isCur |
+-------+-----------+----------+--------+--------------------+-------+
| 4590 | John | Smith | Hire | 4/17/2017 | Y |
+-------+-----------+----------+--------+--------------------+-------+
Now 2 days later, I see the same employee but with an action "Manager Change", so now my DimPerson table becomes this:
+-------+-----------+----------+-----------------+--------------------+-------+
| EmpNo | Firstname | LastName | Action | EffectiveStartDate | isCur |
+-------+-----------+----------+-----------------+--------------------+-------+
| 4590 | John | Smith | Hire | 4/17/2017 | N |
| 4590 | John | Smith | Manager Change | 4/19/2017 | Y |
+-------+-----------+----------+-----------------+--------------------+-------+
So at Month end, when I select all "Current" employees, I will miss the Hire capture for this person since his most recent record is just a manager change and the actual hiring happened "in-month".
Is this normal that you can miss certain changes when doing a periodic snapshot? What you recommend I do to capture the Hire action in this case?

Sounds like you need to fill up your fact table differently- you need a reliable source of numbers of hires, exits and headcount. You could pick those events up directly from the source system if available, or pick them up from your dimension table (if it was guaranteed to contain all the history, and not just end-of-day changes).
The source system would be the best solution, but if the dimension table overall shows the history you need, then rather than selecting the isCur people and seeing their most recent action, you need to get all the dimension table records for the period you are snapshotting, and count the actions of each type.
However I would not recommend you use the dimension table at all to capture transactional history. SCDs on a dimension should be used to track changes to the dimension attributes themselves, not to track the history of actions on the person. Ideally, you would create a transactional fact table to record these actions. That way, you have a transactional fact that records all actions, and you can use that fact table to populate your periodic snapshot at the end of each month, and your dimension table doesn't need to worry about it. Think of your dimension table as a record of the person, not of the actions on the person.

If your fact is intended to show the organizational change at the month end, I would say it is working as designed. The employee has a manager at the end of the month, but did not exist at the end of the previous month. This implies the employee was hired during the month. With a monthly grain, it should not be expected to show the daily activity.
Our employee dimension contains the hire date as a Type 1 attribute. We also include hire date in certain fact tables to allow a role playing relationship with the date dimension.

Related

Stock portfolio database design to support stock split/merge

I am currently creating a web application to manage my stock portfolio, but when it comes to the transaction table, I have some problem I want to ask.
The following is my stock transaction table design:
| column name | datatype |
|----------------|----------------------|
| id | int(10) | primary key, auto increment
| portfolio_id | int(10) | reference to portfolio table primary key
| symbol | varchar(20) | stock symbol e.g: YHOO, GOOG
| type | ENUM('buy','sell') |
| tx_date | DATE |
| price | DOUBLE(15,2) |
| volume | int(20) |
| commission | DOUBLE(15,2) |
| created_at | TIMESTAMP |
| updated_at | TIMESTAMP |
In my current design, I don't have an extra table for storing the stock symbol. I generate a list of stock symbols (using some stock api) for the user to pick when they try to create a new transaction record, and I think that this approach may cause some problem when there is stock split/merge, because I may not be able to retrieve the stock price again using the same symbol.
I would like to know how I should modify my table, in order to support the stock split/merge case?
Stock splits
... symbol type shares ...
... AAPL split 100 ...
2 for 1 split; 100 shares became 200 shares.
Dividends
symbol type amount
AAPL div 20.00
Mergers
Workaround: Record a merger as a sale or the old stock and a buy of the new stock. Add appropriate notes in the 'notes' column.
A more accurate (but more complicated) strategy is to redesign the entire database so that each trade is literally a trade of one transaction for another. A 'buy' trades cash for stock. A 'sell' trades stock for cash. A merger trades stock A for stock B. A split trades 0 shares for 100 shares, etc. Cash is just another asset class.
Foreign stocks
All the major finance sites have this figured out. symbol.exchange is a unique id. No need to reinvent the wheel and create a new id column.
You will also need to add a currency column for foreign stocks.
There are less than 4000 stocks in USA. Why don't you use the stock symbol as the Primary Key. How do you plan for dividends?
I like your approach of having your own custom security (stock) ID. You can then map this to various ticker/CUSIP/ISIN changes over time from the exchange/data provider. So have a security_master table which has your security_ID, and a separate <data_provider>_security table with the one-to-many mappings. And a third security events table (splits, mergers, etc)
Your transaction, holding, and any other tables which refer to securities, will only refer to your internal security ID.
If a stock splits, you still refer to it using the same security_id, but it would map to a security events table that tracks over time, and you would query the appropriate quantity based on the split ratio for that point in time.

Better database basis for table with multiple date informations

What do you think is the better basis, in sense of "easyer to use" with SQL Syntax - the first or the second table?
Please give reasons.
table one:
+----+--------------------------------------+
| id | date1 | date2 | date3 |
+----+------------+------------+------------+
| 1 | 2014-02-15 | 2014-03-24 | 2014-03-24 |
| 2 | NULL | NULL | 2014-08-15 |
| 3 | 2014-06-13 | NULL | NULL |
| 4 | 2014-01-10 | 2014-09-14 | 2014-01-12 |
+----+------------+------------+------------+
table two:
+----+------------+-------+-------+-------+
| id | date | one | two | three |
+----+------------+-------+-------+-------+
| 1 | 2015-07-04 | true | true | false |
| 2 | 2014-06-13 | false | true | false |
| 3 | 2014-11-11 | true | false | false |
| 4 | 2017-03-02 | false | true | true |
+----+------------+-------+-------+-------+
(content of tables doesn't match in this example)
I just want to know if it is easier to deal with when you have just one date field and additional boolean fields instead of multiple date fields. For example if you want to have SELECTs like this
That depends what the dates are.
Just because two fields are both dates tell us nothing about what they have to do with each other, if anything.
If the three dates are totally unrelated and would never be interchangeable in processing, and if they are a fixed set that is not likely to change frequently, like "birth date", "hire date", and "next annual review date", then I would just make them three separate fields. Then when you write queries it would be very straightforward, like
select employee_id, name from employee where next_annual_review_date='2015-02-01'
On the other hand, if you might quite reasonably write a query that would search all three dates, then it makes sense to break the dates out into another table, with a field that identifies the specific date. Like I created a table once for a warehouse system where there were many dates associated with a stock item -- the date it arrived in the warehouse, the date it was sold, inventoried, returned to the warehouse (because the customer returned it, for example), re-sold, lost, damaged, repaired, etc. These dates could come in many possible orders, and many of them could occur multiple times. Like an item might be damaged, repaired, and then damaged and repaired again, or it could be sold, returned, sold again, and returned again, etc. So I created a table for the stock item with the "static" info like part number, description, and the bazillion codes that the user needed to describe the item, and then a separate "stock event" table with the stock item id, event code, the date, and various other stuff. Then there was another stock event table that listed the event codes with descriptions.
This made it easy to construct queries like, "List everything that has happened to this item in the past four years in date order", or "list all items added to the inventory in November", etc.
Your second table seems like an all-around bad idea. I can't think of any advantage to having 3 Boolean fields rather than one field that says what it is. Suppose the three dates are birth date, hire date, and next review date. You could create codes for these -- maybe 1,2, 3; maybe B, H, R; whatever. Then selecting on a specific event is easy enough either way, I guess: select date where hire = true versus select date where event = 'H'.
But listing multiple dates with a description is much easier with a code. You just need a table of codes and descriptions, and then you write
select employee_name, event_code, date
from employee e
join employee_event ev on ev.employee_id=e.employee_id
join event v on v.event_id=ev.event_id
where ... whatever ...
But with the Booleans, you'd need a three-way case/when.
What happens when new event types are added? With an event code, it's just a data change: add a enw record to the event code table. With the Booleans, you need to change the database.
You create the potential for ambiguous data. What happens if two of the Booleans are true, or if none of them are true? What does that mean? There's a whole category of error that can't possibly happen with event codes.
Neither of those are normalized. Normalization is a good way to avoid data anomalies and keep things DRY.
What do your dates represent? What does "one", "two", and "three" represent?
I would go with something like this:
create table my_table (
my_table_id int primary key,
a_more_descriptive_word_than_date date not null,
label text not null
);
The data would look like this:
id date label
1 2014-12-23 one
2 2014-12-24 two
3 2014-12-25 three

Advice on avoiding duplicate inserts when the data is repetitive and I don't have a timestamp

Details
This is a rather weird scenario. I'm trying to store records of sales from a service that I have no control over. I am just visiting a URL and storing the json it returns. It returns the last 25 sales of an item, sorted by cost, and it appears that the values will stay there for a max of 10hrs. The biggest problem is these values don't have timestamps so I can't very accurately infer how long items have been on the list and if they are duplicates.
Example:
Say I check this url at 1pm and I get these results
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Mike | A | 1500 |
| Sue | B | 2000 |
+--------+----------+-------+
At 2pm i get the values and they are:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This would imply that Mike's sale was over 10 hrs ago and the value timed out
At 3pm:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This implies that Joe made 1 sale of $1000 sometime in the past 10 hours, but has also made another sale at the same price since we last checked.
My Goal:
I'd like to be able to store each unique sale in the database once, but allow multiple sales if they do occur(I'm ok w/ only allowing only 1 sale per day if the original plan is too complicated). I realize w/o a timestamp and the potential of 25+ sales causing a value to disappear early, the results aren't going to be 100% accurate, but I'd like to be able to get an at least approximate idea of the sales occurring.
What I've done so far:
So far, I've made a table that has the aforementioned columns as well as a DATETIME of when I insert it into my db and then my own string version of the day it was inserted (YYYYMMDD). I made the combo of the Seller, Category, Price, and My YYYYMMDD date my primary key. I contemplated just searching for entries less than 10hrs old prior to insert, but I'm doing this operation on about 50k entries per hour so i'm afraid of that being too much of a load for the system(I don't know however, MySql is not my forte). What I'm currently doing is I've set the rule that I'm ok w/ only allowing the recording of 1 sale per day(this is done by my pk being the combo of the values i mentioned above), but i discovered that a sale made at 10pm will end up w/ a duplicate added the next day at 1am because the value hasn't time out yet and it's considered unique once again because the date has changed.
What would you do?
I'd love any ideas on how you'd go about achieving something like this. I'm open to all suggestions and I'm ok if the solution results in a seller only having 1 unique sale per day.
Thanks alot folks, I've been staring this problem down for a week now and I feel it's time to get some fresh eyes on it. Any comments are appreciated!
Update - While toying around w/ the thought that I basically want to disable entries for a given pseudo pk (seller-category-price) into the database for 10 hrs each time, it occured to me, what if i had a two stage insert process. Any time I got unqiue values I could put them in a stage one table that stores the data plus a time stamp of entry. If a duplicate tries to get inserted, I just ignore it. After 10hrs, I move those values from the stage 1 table to my final values table thus re-allowing entry for a duplicate sale after 10 hours. I think this would even allow multiple sales w/ overlapping time w/ just a bit of a delay. Say sales occured at 1pm and 6pm, the 1pm entry would be in the 1st stage table until 11pm and then once it got moved, the 6pm entry would be recorded, just 5 hours late(unfortunately the value would end up w/ a 5hr off insert date too which could move a sale to the next day, but i'm ok with that). This avoids the big issue i feared of querying the db on every insert for duplicates. The only thing it complicates is live viewing of the data, but i think doing a query from 2 different tables shouldn't be too bad. What do you guys and gals think? See any flaws in this method?
The problem is less about how to store the data than how to recognize which records are distinct in the first place (despite the fact there is no timestamp or transaction ID to distinguish them). If you can distinguish logically distinct records, then you can create a distinct synthetic ID or timestamp, or do whatever you prefer to store the data.
The approach I would recommend is to sample the URL frequently. If you can consistently harvest the data considerably faster than it is updated, you will be able to determine which records have been observed before by noting the sequence of records that precede them.
Assuming the fields in each record have some variability, it would be very improbable for the same sequence of 5 or 10 or 15 records to occur in a 10-hour period. So as long as you sample the data quickly enough to that only a fraction of the 25 records are rolled over each time, your conclusion would be very confident. This is similar to how DNA is sequenced in a "shotgun" algorithm.
You can determine how frequent the samples need to be by just taking samples and measuring how often you don't see enough prior records -- dial the sample frequency up or down.

MySQL database organization for stocks

So i am creating a web service to predict future stock prices based on historical data for each stock and need to store the following information in a database:
Stock information: company name, ticker symbol, predicted price
For each tracked stock: historical data including daily high, daily low, closing price etc for every day dating back to 1-5 years.
User information: username, password, email, phone number, (the usual)
User tracked stocks: users can pick and choose stocks to be later alerted predictions of via email or phone.
The set of stocks that prediction will be made on will not be predefined and thus there should be a quick way of being able to add and remove stocks and consequently add/remove all data (as stated above) connected to them. My approach to designing is the following:
Table: Stocks
+-----+-----------+----------+------------+----------+-------------+
| ID | Company | ticker | industry | Sector | Prediction |
+-----+-----------+----------+------------+----------+-------------+
Table: HistoricalPrices
+-------------------------------------+--------+--------+-------+----------+
| StockID(using stock ID from above) | Date | High | Low | Closing |
+-------------------------------------+--------+--------+-------+----------+
Table: Users
+-----+------------+------------+---------------+
| ID | Username | Password | PhoneNumber |
+-----+------------+------------+---------------+
Table: TrackedStock
+---------+----------+
| UserID | StockID |
+---------+----------+
Is there a better way at optimizing this organization? As far as queries are concerned the majority will be done on the historical data, for each stock one at a time. (Please excuse any security issues such as passwords being salted and hashed as the purpose of the question is on organization)
Simply said: No. THough you may want to add the colume to the historical prices.
What you may also want is to have a market table and to use lookup tables for industry, sector, possibly prediction - which should possibly be (the prediction) in a separate table with... a date (so you can look back to past predictions).

Advantages of a lookup table with INTs over decimals in MySQL records?

Trying to summarize in as few of words as possible:
I am trying to create a system that tracks the various products an individual can sell and the commission percentage they earn on that particular item. I am thinking about creating reference integers for each product called "levels" which will relate to their commission percentage in a new lookup table instead of a single reference point.. Is this overkill though or are there any benefits over just placing inline for each record?
My gut tells me there are advantages of design 1 below but not sure what they are the more I think about it. If I need to update all individuals selling product X with level Y, indexes and replaces make that easy and fast ultimately in both methods. By using design 2, I can dynamically change any "earn" to whatever percentage I can come up with (0.58988439) for a product whereas I would have to create this "level" in design 1.
Note: the product does not relate to the earn diretly (one sales rep can earn 50% for the same product another sales rep only earns 40% on).
Reference Examples:
Design 1 - two tables
table 1
ID | seller_id | product_id | level
-----------------------------------------------
1 | 11111 | 123A | 2
2 | 11111 | 15J1 | 6
3 | 22222 | 123A | 3
table 2
ID | level | earn
--------------------------
1 | 1 | .60
2 | 2 | .55
3 | 3 | .50
4 | 4 | .45
5 | 5 | .40
6 | 6 | .35
Design 2 - one table
ID | seller_id | product_id | earn
-----------------------------------------------
1 | 11111 | 123A | .55
2 | 11111 | 15J1 | .35
3 | 22222 | 123A | .45
(where earn is decimal based, commission percentage)
Update 1 - 7/9/13
It should also be noted that a rep's commission level can change at any given time. For this, we have planned on simply using status, start, and end dates with ranges for eligible commission levels / earn. For example, a rep may earn a Level 2 (or 55%) from Jan 1 to Feb 1. This would be noted in both designs above. Then when finding what level or percentage a rep was earning at any given time: select * from table where (... agent information) AND start <= :date AND (end > :date or END IS NULL)
Does level mean anything to the business?
For instance, I could imagine a situation where the levels are the unit of management. Perhaps there is a rush for sales one quarter, and the rates for each level change. Or, is there reporting by level? In these situations is would make sense to have a separate "level" table.
Another situation would be different levels for different prices of the product -- perhaps the most you sell it for, the higher the commission. Or, the commissions could be based on thresholds, so someone who has sold enough this year suddenly gets a higher commission.
In other words, there could be lots of rules around commission that go beyond the raw percentage. In that case, a "rule" table would be a necessary part of the data model (and "levels" are a particular type of rule).
On the other hand, if you don't have any such rules and the commission is always based on the person and product, then storing the percentage in the table makes a lot of sense. It is simple and understandable. It also has good performance when accessing the percentage -- which presumably happens much more often than changing it.
First of all, using id values to reference a lookup table has nothing to do with normalization per se. Your design #2 shown above is just as normalized. Lots of people have this misunderstanding about normalization.
One advantage to using a lookup table (design #1) is that you can change what is earned by level 6 (for example), and by updating one row in the lookup table, you implicitly affect all rows that reference that level.
Whereas in design #2, you would have to update every row to apply the same change. Not only does this mean updating many rows (which has performance implictations), but it opens the possibility that you might not execute the correct UPDATE matching all the rows that need updating. So some rows may have the wrong value for what should be the same earning level.
Again, using a lookup table can be a good idea in many cases, it's just not correct to call it normalization.