So i am creating a web service to predict future stock prices based on historical data for each stock and need to store the following information in a database:
Stock information: company name, ticker symbol, predicted price
For each tracked stock: historical data including daily high, daily low, closing price etc for every day dating back to 1-5 years.
User information: username, password, email, phone number, (the usual)
User tracked stocks: users can pick and choose stocks to be later alerted predictions of via email or phone.
The set of stocks that prediction will be made on will not be predefined and thus there should be a quick way of being able to add and remove stocks and consequently add/remove all data (as stated above) connected to them. My approach to designing is the following:
Table: Stocks
+-----+-----------+----------+------------+----------+-------------+
| ID | Company | ticker | industry | Sector | Prediction |
+-----+-----------+----------+------------+----------+-------------+
Table: HistoricalPrices
+-------------------------------------+--------+--------+-------+----------+
| StockID(using stock ID from above) | Date | High | Low | Closing |
+-------------------------------------+--------+--------+-------+----------+
Table: Users
+-----+------------+------------+---------------+
| ID | Username | Password | PhoneNumber |
+-----+------------+------------+---------------+
Table: TrackedStock
+---------+----------+
| UserID | StockID |
+---------+----------+
Is there a better way at optimizing this organization? As far as queries are concerned the majority will be done on the historical data, for each stock one at a time. (Please excuse any security issues such as passwords being salted and hashed as the purpose of the question is on organization)
Simply said: No. THough you may want to add the colume to the historical prices.
What you may also want is to have a market table and to use lookup tables for industry, sector, possibly prediction - which should possibly be (the prediction) in a separate table with... a date (so you can look back to past predictions).
Related
I am tracking employee changes daily in a DimPerson dimension table, and filling up my fact table each end-of-month and counting Hires, Exits, and Headcount.
For this example, let's say I will be populating the fact table end-of-month April 30th. Now here's the problem I am facing:
I have an employee record on April 17th that's a "Hire" action, so at that point in time my DimPerson table reads like this:
+-------+-----------+----------+--------+--------------------+-------+
| EmpNo | Firstname | LastName | Action | EffectiveStartDate | isCur |
+-------+-----------+----------+--------+--------------------+-------+
| 4590 | John | Smith | Hire | 4/17/2017 | Y |
+-------+-----------+----------+--------+--------------------+-------+
Now 2 days later, I see the same employee but with an action "Manager Change", so now my DimPerson table becomes this:
+-------+-----------+----------+-----------------+--------------------+-------+
| EmpNo | Firstname | LastName | Action | EffectiveStartDate | isCur |
+-------+-----------+----------+-----------------+--------------------+-------+
| 4590 | John | Smith | Hire | 4/17/2017 | N |
| 4590 | John | Smith | Manager Change | 4/19/2017 | Y |
+-------+-----------+----------+-----------------+--------------------+-------+
So at Month end, when I select all "Current" employees, I will miss the Hire capture for this person since his most recent record is just a manager change and the actual hiring happened "in-month".
Is this normal that you can miss certain changes when doing a periodic snapshot? What you recommend I do to capture the Hire action in this case?
Sounds like you need to fill up your fact table differently- you need a reliable source of numbers of hires, exits and headcount. You could pick those events up directly from the source system if available, or pick them up from your dimension table (if it was guaranteed to contain all the history, and not just end-of-day changes).
The source system would be the best solution, but if the dimension table overall shows the history you need, then rather than selecting the isCur people and seeing their most recent action, you need to get all the dimension table records for the period you are snapshotting, and count the actions of each type.
However I would not recommend you use the dimension table at all to capture transactional history. SCDs on a dimension should be used to track changes to the dimension attributes themselves, not to track the history of actions on the person. Ideally, you would create a transactional fact table to record these actions. That way, you have a transactional fact that records all actions, and you can use that fact table to populate your periodic snapshot at the end of each month, and your dimension table doesn't need to worry about it. Think of your dimension table as a record of the person, not of the actions on the person.
If your fact is intended to show the organizational change at the month end, I would say it is working as designed. The employee has a manager at the end of the month, but did not exist at the end of the previous month. This implies the employee was hired during the month. With a monthly grain, it should not be expected to show the daily activity.
Our employee dimension contains the hire date as a Type 1 attribute. We also include hire date in certain fact tables to allow a role playing relationship with the date dimension.
I am currently creating a web application to manage my stock portfolio, but when it comes to the transaction table, I have some problem I want to ask.
The following is my stock transaction table design:
| column name | datatype |
|----------------|----------------------|
| id | int(10) | primary key, auto increment
| portfolio_id | int(10) | reference to portfolio table primary key
| symbol | varchar(20) | stock symbol e.g: YHOO, GOOG
| type | ENUM('buy','sell') |
| tx_date | DATE |
| price | DOUBLE(15,2) |
| volume | int(20) |
| commission | DOUBLE(15,2) |
| created_at | TIMESTAMP |
| updated_at | TIMESTAMP |
In my current design, I don't have an extra table for storing the stock symbol. I generate a list of stock symbols (using some stock api) for the user to pick when they try to create a new transaction record, and I think that this approach may cause some problem when there is stock split/merge, because I may not be able to retrieve the stock price again using the same symbol.
I would like to know how I should modify my table, in order to support the stock split/merge case?
Stock splits
... symbol type shares ...
... AAPL split 100 ...
2 for 1 split; 100 shares became 200 shares.
Dividends
symbol type amount
AAPL div 20.00
Mergers
Workaround: Record a merger as a sale or the old stock and a buy of the new stock. Add appropriate notes in the 'notes' column.
A more accurate (but more complicated) strategy is to redesign the entire database so that each trade is literally a trade of one transaction for another. A 'buy' trades cash for stock. A 'sell' trades stock for cash. A merger trades stock A for stock B. A split trades 0 shares for 100 shares, etc. Cash is just another asset class.
Foreign stocks
All the major finance sites have this figured out. symbol.exchange is a unique id. No need to reinvent the wheel and create a new id column.
You will also need to add a currency column for foreign stocks.
There are less than 4000 stocks in USA. Why don't you use the stock symbol as the Primary Key. How do you plan for dividends?
I like your approach of having your own custom security (stock) ID. You can then map this to various ticker/CUSIP/ISIN changes over time from the exchange/data provider. So have a security_master table which has your security_ID, and a separate <data_provider>_security table with the one-to-many mappings. And a third security events table (splits, mergers, etc)
Your transaction, holding, and any other tables which refer to securities, will only refer to your internal security ID.
If a stock splits, you still refer to it using the same security_id, but it would map to a security events table that tracks over time, and you would query the appropriate quantity based on the split ratio for that point in time.
Concise explanation
There is a row in the database which shows the current state of 'Umbrella', forged from the Model 'Product'.
You want to access the complete history of what you deem to be relevant changes to Umbrella, involving related models, quickly and painlessly.
The problem is that paper trail doesn't bring in the beef when the events table is tens of thousands of rows long, and you can't truncate it as it contains important history, and its performance is woeful as it has to parse thousands of lines of YAML to find 'relevant' changes.
Background reading done, still no idea what the problem is called
This seems like something basic to me but I see no mention of others tackling it beyond using papertrail, thus I don't know what its non-proprietarily commonly referred to as, if at all. "ruby on rails what is vs what was architecture without papertrail" was the best title I could think of. I'm creating a one to many relationship between models and time?
Have read "A!!! Design Patterns in Ruby, 2007" which references gang of four's design patterns, no mention of this problem?
Have tried "paper trail" gem but it doesn't quite solve it
The problem
Assuming you have Products, Companies and Categories, and
Product: id, name, price, barcode, (also company_id and category_id)
Company: id, name, registered_company_number
Category: id, name, some_immutable_field
Company has many Products
Category has many Products
And you need to see history of each Product, including changes on itself such as price, changes to which company it belongs to, changes to company name, same thing for categories, such as:
date | event | company name | cmp | category | cat | name | price
| | | id | name | id | |
------|---------------|--------------|-----|----------|-----|----------|------
jan11 | created | megacorp | 1 | outdoors | 101 | umbrella | 10
feb11 | cat change | megacorp | 1 | fashion | 102 | umbrella | 10
mar11 | cat rename | megacorp | 1 | vogue | 102 | umbrella | 10
apr11 | cmp rename | megacorp inc | 1 | vogue | 102 | umbrella | 10
may11 | cmp change | ultra & sons | 2 | vogue | 102 | umbrella | 12
jul11 | cmp change | megacorp | 1 | vogue | 102 | umbrella | 12
note that whilst umbrella was with ultra & sons, megacorp inc changed its name back to megacorp, but we don't show that in this history as its not relevant to this product. (The name change of company 1 happens in jun11, but is not shown)
This can be accomplished with papertrail, but the code to do it is either very complex, long and procedural; or if written 'elegantly' in the way papertrail intended, very very slow as it makes many db calls to what is currently a very bloated events table.
Why paper trail is not the right solution here
Paper trail stores all changes in YAML, the database table is polymorphic and stores a lot of data from many different model. This table and thus this gem seems to be suited to identify who did what changes... but to use it for history like I need to use it, its like a god table that stores all information about what was and has too much responsibility.
The history I am after does not care about all changes to an object, only certain fields. (But we still need to record all the small changes, just not include them in the history of products, so we can't just not-record these things as paper trail has its regular duties identifying who did what, it cannot be optimised solely for this purpose). Pulling this information requires getting all records where the item_type is Product, where the item_id is of the currently being viewed product_id, then parsing the YAML, and seeing if we are interested in the changes (is a field changed, which is a field we are interested in seeing the changes to?). Then doing the same for every category and company that product has been associated with in its lifetime, but only keeping the changes which occur in the windows for which product has been associated to said category/company.
Paper trail can be turned off quite easily... so if one of your devs were to disable it in the code somewhere as an optimisation whilst some operations were to be run, but forget to write the code to turn it back on, no history recorded. And because paper trail is more of a man on the loop than man in the loop, if its not running you might not notice (then have to write overly complex code which catches all the possible scenarios with holey data). A solution which enforces the saving of history is required.
Half baked solution
Conceptually I think that the models should be split between that which persists and that which changes. I am surprised this is not something baked into rails from the ground up, but then there are some issues with it:
Product: id, barcode
Product_period: id, name, price, product_id, start_date, (also company_id and product_id)
Company: id, registered_company_number
Company_period: id, name, company_id, start_date
Category: id, some_immutable_field
Category_period: id, name, category_id, start_date
Every time the price of the product, or the company_id of the product changes, a new row is added to product_period which records the beginning of a new era where the umbrella now costs $11, along with the start_date (well, time) that this auspicious period begins.
Thus in the product model, all calls to things which are immutable or we only care about what the most recent value is, remain as they are; whereas things which change and we care, have methods which to an outsider user (or existing code) appear to be operating on product model, but in fact make a call to most recent product_period for this product and get the latest values there.
This solves the problem superficially, but its a little long winded, and it still has the problem that you have to poke around through company_period and category_period selecting relevant entries (as in the company/category experiences a change and it is during a time when product was associated with it) rather than something more elegant.
At least the MySQL will run faster and there is more freedom to make indexes, and there is no longer thousands of YAML parses bogging it down.
On the quest to write more readable code, are these improvements sufficient? What do other people do? Does this have a name? Is there a more elegant solution or just a quagmire of trade offs?
There are a bunch of other versioning and history gems for rails (I contributed to the first one, 10 years ago!) - find them here, https://www.ruby-toolbox.com/categories/Active_Record_Versioning
They all have different methods for storing like you suggest above, and some are configurable. I also don't agree with the polymorphic god table for all users, but it's not too slow if you have decent indexes.
I have some tables, they are:
user
==============================
user_id | username | etc..
==============================
user_metadata
====================================
user_id | birthday | gender | etc..
====================================
game
========================
game_id | name | etc..
========================
I want to store users liking games, and since age attribute is important, I need to differ when did the like was made, it should be different entity if someone liked a game when he/she was in the age of 5 or was in the age of 8. So which of table structure would you recommend?
Store the age of the user when the likes made:
user_likes
==========================================
user_id | game_id | user_age_when_liking
==========================================
Store the timestamp
user_likes
==========================================
user_id | game_id | liked_at (timestamp)
==========================================
So with the option number 2, if I need to get all users like with a certain age, I will calculate the year difference between the user birthdate and liked_at.
Other suggestion is very welcomed.
It mostly depends on the purpose of the use of the column. If you never need to do any time to time calculations (addition, subtraction etc) based on when the like occurred, than the age of the person is completely valid (basically you're pre-evaluating birthDate-likeDate). However, if you have any intention of saying that someone liked a game 10 days ago, it will be much more work to reverse the pre-evaluated user-age back to a likeDate to allow another calucation. Keeping in mind that you most likely want to keep your database data extensible, using likedAt is preferred in my opinion.
I would also note that stackoverflow is not a site designed to answer theoretical questions that don't have a specific code answer, this question should probably be on programmers.stackexchange.com.
I am trying to decide which would be the best data warehouse type design. It will be used to find historical price averages of different item during different time periods using a Google type search. For example, what was the avg price of Stock A this month, 3m, 6m, and 1 year ago? The issue is that I do not have an item name that I can use, I have descriptions fields about the item.
This means that I can't aggregate items into views, since the same item maybe listed 20 times each with different descriptions, so I have to on the fly do a full-text search on the description field, grab the price where the insertdate is < 3 months ago. Then find the average of that.
So is my best bet to have everything in one table like:
MAIN
----------------------------
ID | Description | Price | Date
or many tables:
DESCRIPTION
------------------
ID | Description |
PRICE
---------
ID | PRICE
And just join to get the data I want. The database will contain a few million rows. If I had a way to get the real name of the item I could see pre aggregating the data, but that is not an option for me. I appreciate any advice!
I'd say option 2 ... keep the top level details in the "description" table. And the historic data in the "price" table (albeit, with a Date field added to capture the temporal value)
As Joel suggested, Option 2 is likely going to provide you more flexibility. I would suggest including additional dates in each table to accomodate for slowly changing dimensions. Descriptions and other attributes about a given item may change over time.
In the case of a brick and mortar retailer, you would quite likely include the Store ID as well because items are quite likely priced differently in different locations due to competition and demographic make-up of your customers near a given location.
DESCRIPTION
---------------------------------------------------
ID | Description | Effective Date | Expiration Date
PRICE
-----------------------------------------------------------
ID | Location ID | Price | Effective Date | Expiration Date