Database Design for Historical Aggregation

Database Design for Historical Aggregation - mysql

I am trying to decide which would be the best data warehouse type design. It will be used to find historical price averages of different item during different time periods using a Google type search. For example, what was the avg price of Stock A this month, 3m, 6m, and 1 year ago? The issue is that I do not have an item name that I can use, I have descriptions fields about the item.
This means that I can't aggregate items into views, since the same item maybe listed 20 times each with different descriptions, so I have to on the fly do a full-text search on the description field, grab the price where the insertdate is < 3 months ago. Then find the average of that.
So is my best bet to have everything in one table like:
MAIN
----------------------------
ID | Description | Price | Date
or many tables:
DESCRIPTION
------------------
ID | Description |
PRICE
---------
ID | PRICE
And just join to get the data I want. The database will contain a few million rows. If I had a way to get the real name of the item I could see pre aggregating the data, but that is not an option for me. I appreciate any advice!

I'd say option 2 ... keep the top level details in the "description" table. And the historic data in the "price" table (albeit, with a Date field added to capture the temporal value)

As Joel suggested, Option 2 is likely going to provide you more flexibility. I would suggest including additional dates in each table to accomodate for slowly changing dimensions. Descriptions and other attributes about a given item may change over time.
In the case of a brick and mortar retailer, you would quite likely include the Store ID as well because items are quite likely priced differently in different locations due to competition and demographic make-up of your customers near a given location.
DESCRIPTION
---------------------------------------------------
ID | Description | Effective Date | Expiration Date
PRICE
-----------------------------------------------------------
ID | Location ID | Price | Effective Date | Expiration Date

Related

Selecting Available Items Based on Date Range Availability

I know questions of this type have been asked here, but wondering if this scenario is do-able (I didn't see other examples of this).
Let's say I have a MySQL DB table that has the following items:
item | type
----------------
1 | Small
2 | Small
3 | Large
4 | Small
And I have an order table where an end-user can "check-out" these items for a date range that he/she specifies (sort of like booking a hotel room):
orderid | item | startdate | enddate
--------------------------------------------
1 2,4 2015-08-15 2015-09-15
Potentially, there can be thousands of items, and anyone can choose to reserve a large number at once if desired. This is why I represent item as a string 2,4 in the order table example above.
Assuming the end-user were to pick a date range within the orderid's date range, how can I do a MySQL query that only shows items which are available outside the start/end date range when I represent the item as a string in the order table? Would this be possible?

Storing data counts from table into a "trending" table

We have a table for which we have to present many many counts for different combinations of fields.
This takes quite a while to do on the fly and doesn't provide historical data, so I'm thinking in the best way to store those counts in another table, with a timestamp, so we can query them fast and get historical trends.
For each count we need 4 pieces of information to identify it, and there are about 1000 different metrics we would like to store.
I'm thinking on three different strategies, having a count and a timestamp but varying in how to identify the count for retrieval.
1 table with 4 fields to identify the count, the 4 fields wouldn't be normalized as they contain data from different external tables.
1 table with 1 "tag" field, which will contain the 4 pieces of information as a tag. This tags could be enriched and kept in another table maybe having a field for each tag part and linking them to the external tables.
Different tables for the different groups of counts to be able to normalize on one or more fields, but this will need anywhere from 6 to tens of tables.
I'm going with the first one, not normalized at all, but wondering if anyone has a better or simpler way to store all this counts.
Sample of a value:
status,installed,all,virtual,1234,01/05/2015
First field, status, can have up to 10 values
Second field, installed, can have up to 10 per different field 1
Third field, all,can have up to 10 different values, but they are the same for all categories
Fourth field, virtual, can have up to 30 values and will also be the same for all previous categories.
Last two fields will be a number and a timestamp
Thanks,
Isaac

When you have a lot of metrics and you don't need to use them to do intra-metrics calculation you can go for the 1. solution.
I would probably build a table like this
Satus_id | Installed_id | All_id | Virtual_id | Date | Value
Or if the combination of the first four columns have a proper name, I would probably create two tables (I think you refer to this possibility as the second solution with the 2):
Metric Table
Satus_id | Installed_id | All_id | Virtual_id | Metric_id | Metric_Name
Values Table
Metric_id | Date | Value
This is good if you have names for your metrics or other details which otherwise you will need to duplicate for each combination with the first approach.
In both cases it will be a bit complicated to do intra-rows operations using different metrics, for this reason this approach is suggested only for high level KPIs.
Finally, because all possible combination for the last two fields are always present in you table you can think to convert them to a columns:
Satus_id | Installed_id | Date | All1_Virtual1 | All1_Virtual2 | ... | All10_Virtua30
With 10 values for All and 30 for Virtual you will have 300 columns, not very easy to handle, but they will be worth to have if you have to do something like:
(All1_Virtual2 - All5_Virtual23) * All6_Virtual12
But in these case I would prefer (if possible) to do the calculation in advance to reduce the number of columns.

Join items to a new item with many-to-many (n:m)

my rule of business is something like a used car/motobike dealership:
My table "stock" contains cars, so no two of the same products as each automobile belongs to a different owner.
Sometimes the owner has two cars that he wants to sell separately, but also wants to sell them together, eg:
Owner has a car and a motorcycle:
+----------------+
| id | Stock |
+----+-----------+
| 1 | car |
+----+-----------+
| 2 | motorcycle|
+----+-----------+
In case he wants to advertise or sell in two ways, the first would be the car for U$10.000 and motobike for U$5.000
But it also gives the option to sell both together for a lower price (car + bike U$ 12.000), eg:
+----+-----------+--------------------+-----------+
| id | id_poster | Stock | Price |
+----+-----------+--------------------+-----------+
| 1 | 1 | car | U$ 10.000 |
+----+-----------+--------------------+-----------+
| 2 | 2 | motorcycle | U$ 5.000 |
+----+-----------+--------------------+-----------+
| 1 | 3 | car | U$ 12.000 |
+----+-----------+--------------------+-----------+
| 2 | 3 | motorcycle | U$ 12.000 |
+----+-----------+--------------------+-----------+
This is the best way to do this?
My structure is already doing so (just as I believe to be the best way), I'm using foreign key and n:m, see my structure:

Ok, so if I'm understanding the question right, you're wondering if using a junction table is right. It's still difficult to tell from just your table structures. The poster table just has a price, and the stock table just has a title and description. It's not clear from those fields just what they're supposed to represent or how they're supposed to be used.
If you truly have a many-to-many relationship between stock and poster entities -- that is, a given stock can have 0, 1 or more poster, and a poster can have 0, 1 or more stock -- then you're fine. A junction table is the best way to represent a true many-to-many relationship.
However, I don't understand why you would want to store a price in poster like that. Why would one price need to be associated with multiple titles and descriptions? That would mean if you changed it in one spot that it would change for all related stock. Maybe that's what you want (say, if your site were offering both A1 and A0 size posters, or different paper weights with a single, flat price across the site regardless of the poster produced). However, there just aren't enough fields in your tables currently to see what you're trying to model or accomplish.
So: Is a junction table the best way to model a many-to-many relationship? Yes, absolutely. Are your data entities in a many-to-many relationship? I have no idea. There isn't enough information to be able to tell.
A price, in and of itself, may be one-to-one (each item has once price), one-to-many (each item has multiple prices, such as multiple currencies), or -- if you use a price category or type system like with paper sizes -- then each item has multiple price categories, and each price category applies to multiple items.
So, if you can tell me why a stock has multiple prices, or why a single poster price might apply to multiple stock, then I can tell you if using a junction table is correct in your situation.
Having seen your edit that includes your business rules, this is exactly the correct structure to use. One car can be in many postings, and one posting may have many cars. That's a classic many-to-many, and using a junction table is absolutely correct.

Not clear how the examples relate to your diagram because you use different terminology, but I think it's safe to say: If you want to store something like "this entity consists of orange, apple and pear" then the DB design you show is the correct way to do it. You'd have one poster entry, and three entries in the poster_has_stock pointing to the same poster and three elements in stock.

The structure which you're using is best solution in your case, no need to change, just 2 minor changes needed:
1. remove 2 indexes: fk_poster_has_stock_stock1_idx and fk_poster_has_stock_poster_idx, because they are primary keys already
2. stock_price field should use decimal data type (more precise)
You can read more about Decimal data type here

I think Your solution is nearly perfect. I think You may add "id" to "poster_has_stock" table. And of course change price type (it was written upper).
But You may consider second option with stock_id in poster table.
WHY?
-There should be no poster with no stock connected to it.
-In most cases there will be offers: one stock <=> one poster
This will allow You also to add as many dependend stocks to poster as You want.
You can also add poster_special_price DECIMAL (9,2) to poster table.
This will allow You easy to show:
price for stock item.
Special price for stock item with it's dependencies.
This will be also easier to manage in controller (create, update) - You will be adding poster already with stock, No transactions will be needed during adding new poster.

you may consider a new table that creates a relationship between the stock items such as:
stock_component
---------------
parent_stock_id
child_stock_id
child_qty
in this way, you can link up many children into one parent in the style of a bill of materials, then the rest of your links can continue to be simply related to stock_id of the appropriate parent.

Advice on avoiding duplicate inserts when the data is repetitive and I don't have a timestamp

Details
This is a rather weird scenario. I'm trying to store records of sales from a service that I have no control over. I am just visiting a URL and storing the json it returns. It returns the last 25 sales of an item, sorted by cost, and it appears that the values will stay there for a max of 10hrs. The biggest problem is these values don't have timestamps so I can't very accurately infer how long items have been on the list and if they are duplicates.
Example:
Say I check this url at 1pm and I get these results
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Mike | A | 1500 |
| Sue | B | 2000 |
+--------+----------+-------+
At 2pm i get the values and they are:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This would imply that Mike's sale was over 10 hrs ago and the value timed out
At 3pm:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This implies that Joe made 1 sale of $1000 sometime in the past 10 hours, but has also made another sale at the same price since we last checked.
My Goal:
I'd like to be able to store each unique sale in the database once, but allow multiple sales if they do occur(I'm ok w/ only allowing only 1 sale per day if the original plan is too complicated). I realize w/o a timestamp and the potential of 25+ sales causing a value to disappear early, the results aren't going to be 100% accurate, but I'd like to be able to get an at least approximate idea of the sales occurring.
What I've done so far:
So far, I've made a table that has the aforementioned columns as well as a DATETIME of when I insert it into my db and then my own string version of the day it was inserted (YYYYMMDD). I made the combo of the Seller, Category, Price, and My YYYYMMDD date my primary key. I contemplated just searching for entries less than 10hrs old prior to insert, but I'm doing this operation on about 50k entries per hour so i'm afraid of that being too much of a load for the system(I don't know however, MySql is not my forte). What I'm currently doing is I've set the rule that I'm ok w/ only allowing the recording of 1 sale per day(this is done by my pk being the combo of the values i mentioned above), but i discovered that a sale made at 10pm will end up w/ a duplicate added the next day at 1am because the value hasn't time out yet and it's considered unique once again because the date has changed.
What would you do?
I'd love any ideas on how you'd go about achieving something like this. I'm open to all suggestions and I'm ok if the solution results in a seller only having 1 unique sale per day.
Thanks alot folks, I've been staring this problem down for a week now and I feel it's time to get some fresh eyes on it. Any comments are appreciated!
Update - While toying around w/ the thought that I basically want to disable entries for a given pseudo pk (seller-category-price) into the database for 10 hrs each time, it occured to me, what if i had a two stage insert process. Any time I got unqiue values I could put them in a stage one table that stores the data plus a time stamp of entry. If a duplicate tries to get inserted, I just ignore it. After 10hrs, I move those values from the stage 1 table to my final values table thus re-allowing entry for a duplicate sale after 10 hours. I think this would even allow multiple sales w/ overlapping time w/ just a bit of a delay. Say sales occured at 1pm and 6pm, the 1pm entry would be in the 1st stage table until 11pm and then once it got moved, the 6pm entry would be recorded, just 5 hours late(unfortunately the value would end up w/ a 5hr off insert date too which could move a sale to the next day, but i'm ok with that). This avoids the big issue i feared of querying the db on every insert for duplicates. The only thing it complicates is live viewing of the data, but i think doing a query from 2 different tables shouldn't be too bad. What do you guys and gals think? See any flaws in this method?

The problem is less about how to store the data than how to recognize which records are distinct in the first place (despite the fact there is no timestamp or transaction ID to distinguish them). If you can distinguish logically distinct records, then you can create a distinct synthetic ID or timestamp, or do whatever you prefer to store the data.
The approach I would recommend is to sample the URL frequently. If you can consistently harvest the data considerably faster than it is updated, you will be able to determine which records have been observed before by noting the sequence of records that precede them.
Assuming the fields in each record have some variability, it would be very improbable for the same sequence of 5 or 10 or 15 records to occur in a 10-hour period. So as long as you sample the data quickly enough to that only a fraction of the 25 records are rolled over each time, your conclusion would be very confident. This is similar to how DNA is sequenced in a "shotgun" algorithm.
You can determine how frequent the samples need to be by just taking samples and measuring how often you don't see enough prior records -- dial the sample frequency up or down.

MySQL database organization for stocks

So i am creating a web service to predict future stock prices based on historical data for each stock and need to store the following information in a database:
Stock information: company name, ticker symbol, predicted price
For each tracked stock: historical data including daily high, daily low, closing price etc for every day dating back to 1-5 years.
User information: username, password, email, phone number, (the usual)
User tracked stocks: users can pick and choose stocks to be later alerted predictions of via email or phone.
The set of stocks that prediction will be made on will not be predefined and thus there should be a quick way of being able to add and remove stocks and consequently add/remove all data (as stated above) connected to them. My approach to designing is the following:
Table: Stocks
+-----+-----------+----------+------------+----------+-------------+
| ID | Company | ticker | industry | Sector | Prediction |
+-----+-----------+----------+------------+----------+-------------+
Table: HistoricalPrices
+-------------------------------------+--------+--------+-------+----------+
| StockID(using stock ID from above) | Date | High | Low | Closing |
+-------------------------------------+--------+--------+-------+----------+
Table: Users
+-----+------------+------------+---------------+
| ID | Username | Password | PhoneNumber |
+-----+------------+------------+---------------+
Table: TrackedStock
+---------+----------+
| UserID | StockID |
+---------+----------+
Is there a better way at optimizing this organization? As far as queries are concerned the majority will be done on the historical data, for each stock one at a time. (Please excuse any security issues such as passwords being salted and hashed as the purpose of the question is on organization)

Simply said: No. THough you may want to add the colume to the historical prices.
What you may also want is to have a market table and to use lookup tables for industry, sector, possibly prediction - which should possibly be (the prediction) in a separate table with... a date (so you can look back to past predictions).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008