Time Dimension in Data Warehouse - mysql

I've a fact table that stores multiple date fields in it's rows. I would like to keep the design flexible and link all of these fields with the time dimension. However, the problem is that my reports end up having too many joins in their queries (one for each date field). How do I mitigate this problem ?
I have one idea of storing both the time dimension references (fast searching) and date fields (efficient retrieval). What would be the possible problems in doing so ?
Generalizing this idea, should we do it for other fields in the fact table as well ?
The table structure
acc_num | acc_approved_date| acc_rejected_date| file_gen_date
Proposed changes while linking to the date dimension
acc_num | acc_approved_date_id| acc_rejected_date_id| file_gen_date_id
However this creates problems of having too many joins to the date dimension table while creating the reports that captures all of these dates. I'm proposing a hybrid of the two where I store both the dates and the ids for these fields.

You'd only have joins to the date dimension table if you wanted to find out something about the date (a name of the month and year for example) or wanted to filter on the date.
Doing it by multiple date keys is the correct way of doing it- for all dimensions you want to filter by or include in your query results, you need a join.

Related

how to use multiple tables without duplication in tableau

I've trouble understanding how this should work...basically I've 2 main tables, in one I've Revenues, in another Costs.
Revenues table has fields as: P&L (string), Category (string), Products (string), Sold (int), invoiced (int), delivered (int), date (date).
Costs table has: P&L (string), Category (string), Products (string), Costs (int), date (date).
I'd like to use tables together to perform various calcs like margin, for example, at any level (total margin, which means total revenues - total costs, or at Category level for which I should be able to filter any category I have and perform the calc and so on).
Problem is, any tentative I've made to use relations or join, resulted in duplications.
The only workaround I was able to perform so far is to leave revenues table as it is, and create many Costs table, 1 for field basically (table1 with category and costs plus date, table2 with products, costs and date etc.). Joining Revenues with one of these tables seems to work but, in this way, I'm not able to create a wider view (one goal is to make a big table in the viz where we could read at once all the data). Plus, another problem I 've seen it appear doing this workaround is that, if I want to split by date costs, but I use the date column from the revenues table, even if the date is the same (I've done a copy/paste between tables basically), tableau doesn't recognize the date correctly, so to split costs, I've to use costs'date column, and to split revenues, I've to use revenues' date columns, which is frankly a pain...
So my question: how could I merge the 2 tables in one, or anyway how could I put all the data together in a working table to perform any kind of calcs,and also how could I use just 1 column for date that works for all the date altogether?
I've upload a file here to understand better what I'm trying to combine. Thank you guys
Data file
ps.: seems that tableau is using sql behind for these tasks so probably someone skilled in this kind of problem in sql could also help...for this I 've tagged sql as well, thanks
You need to UNION those 2 tables together, but are they really in Google or you just did that to demo it here?
If you're using Excel - both Revenue & Cost must be different sheets in the same XLS file
If you're using CSV - both Revenue & Cost must be different files (hopefully in the same folder)
I would really hope that you're using a database (some form of SQL), but either of the above options, UNION the data and it will work the way you expect :)

How to store recent usage frequency in MySQL

I'm working on the Product Catalog module of an Invoicing application.
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalog.
How can I store this "usage recency/frequency" in the database?
I'm thinking about adding a new field recency which would be increased by 1 every time the product was used, and decreased by 1/(count of all products), when an other product is used. Then use this recency field for ordering, but it doesn't seem to me the best solution.
Can you help me what is the best practice for this kind of problem?
Solution for the recency calculation:
Create a new column in the products table, named last_used_on for example. Its data type should be TIMESTAMP (the MySQL representation for the Unix-time).
Advantages:
Timestamps contains both date and time parts.
It makes possible VERY precise calculations and comparisons in regard
to dates and times.
It lets you format the saved values in the date-time format of your
choice.
You can convert from any date-time format into a timestamp.
In regard to your autocomplete fields, it allows you to filter
the products list as you wish. For example, to display all products
used since [date-time]. Or to fetch all products used between
[date-time-1] and [date-time-2]. Or get the products used only on Mondays, at 1:37:12 PM, in the last two years, two months and three
days (so flexible timestamps are).
Resources:
Unix-Time
The DATE, DATETIME, and TIMESTAMP Types
How should unix timestamps be stored in int columns?
How to convert human date to unix timestamp in Mysql?
Solution for the usage rate calculation:
Well, actually, you are not speaking about a frequency calculation, but about a rate - even though one can argue that frequency is a rate, too.
Frequency implies using the time as the reference unit and it's measured in Hertz (Hz = [1/second]). For example, let's say you want to query how many times a product was used in the last year.
A rate, on the other hand, is a comparison, a relation between two related units. Like for example the exchange rate USD/EUR - they are both currencies. If the comparison takes place between two terms of the same type, then the result is a number without measurement units: a percentage. Like: 50 apples / 273 apples = 0.1832 = 18.32%
That said, I suppose you tried to calculate the usage rate: the number of usages of a product in relation with the number of usages of all products. Like, for a product: usage rate of the product = 17 usages of the product / 112 total usages = 0.1517... = 15.17%. And in the autocomplete you'd want to display the products with a usage rate bigger than a given percentage (like 9% for example).
This is easy to implement. In the products table add a column usages of type int or bigint and simply increment its value each time a product is used. And then, when you want to fetch the most used products, just apply a filter like in this sql statement:
SELECT
id,
name,
(usages*100) / (SELECT sum(usages) as total_usages FROM products) as usage_rate
FROM products
GROUP BY id
HAVING usage_rate > 9
ORDER BY usage_rate DESC;
Here's a little study case:
In the end, recency, frequency and rate are three different things.
Good luck.
To allow for future flexibility, I'd suggest the following additional (*) table to store the entire history of product usage by all users:
Name: product_usage
Columns:
id - internal surrogate auto-incrementing primary key
product_id (int) - foreign key to product identifier
user_id (int) - foreign key to user identifier
timestamp (datetime) - date/time the product was used
This would allow the query to be fine tuned as necessary. E.g. you may decide to only order by past usage for the logged in user. Or perhaps total usage within a particular timeframe would be more relevant. Such a table may also have a dual purpose of auditing - e.g. to report on the most popular or unpopular products amongst all users.
(*) assuming something similar doesn't already exist in your database schema
Your problem is related to many other web-scale search applications, such as e.g. showing spell corrections, related searches, or "trending" topics. You recognized correctly that both recency and frequency are important criteria in determining "popular" suggestions. In practice, it is desirable to compromise between the two: Recency alone will suffer from random fluctuations; but you also don't want to use only frequency, since some products might have been purchased a lot in the past, but their popularity is declining (or they might have gone out of stock or replaced by successor models).
A very simple but effective implementation that is typically used in these scenarios is exponential smoothing. First of all, most of the time it suffices to update popularities at fixed intervals (say, once each day). Set a decay parameter α (say, .95) that tells you how much yesterday's orders count compared to today's. Similarly, orders from two days ago will be worth α*α~.9 times as today's, and so on. To estimate this parameter, note that the value decays to one half after log(.5)/log(α) days (about 14 days for α=.95).
The implementation only requires a single additional field per product,
orders_decayed. Then, all you have to do is to update this value each night with the total daily orders:
orders_decayed = α * orders_decayed + (1-α) * orders_today.
You can sort your applicable suggestions according to this value.
To have an individual user experience, you should not rely on a field in the product table, but rather on the history of the user.
The occurrences of the product in past invoices created by the user would be a good starting point. The advantage is that you don't need to add fields or tables for this functionality. You simply rely on data that is already present anyway.
Since it is an auto-complete field, maybe past usage is not really relevant. Display n search results as the user types. If you feel that results are better if you include recency in the calculation of the order, go with it.
Now, implementation may defer depending on how and when product should be displayed. Whether it has to be user specific usage frequency or application specific (overall). But, in both case, I would suggest to have a history table, which later you can use for other analysis.
You could design you history table with atleast below columns:
Id | ProductId | LastUsed (timestamp) | UserId
And, now you can create a view, which will query this table for specific time range (something like product frequency of last week, last month or last year) and will give you highest sold product for specific time range.
Same can be used for User's specific frequency by adding additional condition to filter by Userid.
I'm thinking about adding a new field recency which would be increased
by 1 every time the product was used, and decreased by 1/(count of all
products), when an other product is used. Then use this recency field
for ordering, but it doesn't seem to me the best solution.
Yes, it is not a good practice to add a column for this and update every time. Imagine, this product is most awaiting product and people love to buy it. Now, at a time, 1000 people or may be more requested for this product and for every request you are going to update same row, since to maintain the concurrency database has to lock that specific row and update for each request, which is definitely going to hit your database and application performance instead you can simply insert a new row.
The other possible solution is, you could use your existing invoice table as it will definitely have all product and user specific information and create a view to get frequently used product as I mentioned above.
Please note that, this is an another option to achieve what you are expecting. But, I would personally recommend to have history table instead.
The scenario
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalogue.
your suggested solution
How can I store this "usage recency/frequency" in the database?
If it is a web application, don't store it in a Database in your server. Each user has different choices.
Store it in the user's browser as Cookie or Localstorage because it will improve the User Experience.
If you still want to store it in MySQL table,
Do the following
Create a column recency as said in question.
When each time the item used, increase the count by 1 as said in question.
Don't decrease it when other items get used.
To get the recent most used item,
query
SELECT * FROM table WHERE recence = (SELECT MAX(recence) FROM table);
Side note
Go for the database use only if you want to show the recent most used products without depending the user.
As you aren't certain on wich measure to choose, and it's rather user experience related problem, I advice you have a number of measures and provide a user an option to choose one he/she prefers. For example the set of available measures could include most popular product last week, last month, last 3 months, last year, overall total. For the sake of performance I'd prefer to store those statistics in a separate table which is refreshed by a scheduled job running every 3 hours for example.

Linking Time Series Data to Records in a Relational Database

I’ve been thinking about this for a couple of days but I feel that I’m lacking the right words in order to ask google the questions I need an answer to. That’s why I’d really appreciate an kind of help, hints or guidance.
First of all, I have almost no experience with databases (apart from misusing Excel as such) and, unfortunately, I have all my data written in very impractical and huge .csv files.
What I have:
I have time series data (in 15 minute-steps) for several hundred sensors (SP) over the course of several years (a couple of million rows in total) in Table 1. There are also some weather condition data (WCD) that applies to all of my sensors and is therefore stored in the same table.
Note that each sensor delivers two data points per measurement.
Table1 (Sensors as Columns)
Now I also have another table (Table 2) that lists several static properties that define each sensor in Table 1.
Table 2 (Sensors as Rows)
My main question is concerning database design and general implementation (MySQL or MS Access): Is it really necessary to have hundreds of columns (two for each sensor) in Table1? I wish I could store the “link” to the respective time series data simply as two additional columns in Table2.
Is that feasible? Does that even make sense? How would I set up this database automatically (coming from .csv files with a different structure) since I can’t do type in every column by hand for hundreds of sensors and their attached time series?
In the end, I want to be able to make a query/sort my data (see below) by timeframe, date and sensor-properties.
The reason for all of this is the following:
I want to create a third table (Table3) which “stores” dynamic values. These values are results of calculations based on the sensor-measurements and WCD in Table 1. However, depending on the sensor-properties in Table2, the sensors and their respective time series data that serve as input for the calculations of Table3 might differ from set to set.
That way I want to obtain e.g. Set 1: “a portfolio of sensors with location A for each month between January 2010 and November 2011” and store it somewhere. Then I want to do the same for Set 2: e.g. “a portfolio of sensors with location B for the same time frame”. Finally I will compare these different portfolios and conduct further analysis on them. Does that sound reasonable at all??
So far, I’m not even sure whether I should actually store the results for each calculation of Table3 in the database or if I output them query and feed them directly into my analyzation tool. What makes more sense?
A more useful structure for your sensor and WCD data would be:
Table SD - Sensor Data
Columns:
Datetime
Sensor
A_value
B_value
With this structure you do not need to store a link to the time series data in Table 2--the Sensor value is the common data that links the tables.
If your weather conditions data all have the same type of values and/or attributes then you should normalize it similarly:
Table WCD - Weather Conditions Data, Normalized
Columns:
Datetime
Weather_condition
Weather_condition_value
From your example, it looks like different weather conditions may have different attributes (or different data types of attributes), in which case the form in which you have the WCD in your Table 1 may be most appropriate.
Storing the results of your calculations in another table sounds like a reasonable thing to do if at least some of your further analysis could be, or will be, done using SQL.

MySQL Database Design Questions

I am currently working on a web service that stores and displays money currency data.
I have two MySQL tables, CurrencyTable and CurrencyValueTable.
The CurrencyTable holds the names of the currencies as well as their description and so forth, like so:
CREATE TABLE CurrencyTable ( name VARCHAR(20), description TEXT, .... );
The CurrencyValueTable holds the values of the currencies during the day - a new value is inserted every 2 minutes when the market is open. The table looks like this:
CREATE TABLE CurrencyValueTable ( currency_name VARCHAR(20), value FLOAT, 'datetime' DATETIME, ....);
I have two questions regarding this design:
1) I have more than 200 currencies. Is it better to have a separate CurrencyValueTable for each currency or hold them all in one table?
2) I need to be able to show the current (latest) value of the currency. Is it better to just insert such a field to the CurrencyTable and update it every two minutes or is it better to use a statement like:
SELECT value FROM CurrencyValueTable ORDER BY 'datetime' DESC LIMIT 1
The second option seems slower.. I am leaning towards the first one (which is also easier to implement).
Any input would be greatly appreciated!!
p.s. - please ignore SQL syntax / other errors, I typed it off the top of my head..
Thanks!
To your questions:
I would use one table. Especially if you need to report on or compare data from multiple currencies, it will be incredibly improved by sticking to one table.
If you don't have a need to track the history of each currency's value, then go ahead and just update a single value -- but in that case, why even have a separate table? You can just add "latest value" as a field in the currency table and update it there. If you do need to track history, then you will need the two tables and the SQL you posted will work.
As an aside, instead of FLOAT I would use DECIMAL(10,2). After MySQL 5.0, this will actually have improved results when it comes to currency handling with rounding.
It is better to have one table holding all currencies
If there is need for historical prices, then the table needs to hold them. A reasonable compromise in many situations is to split the price table into a full list of historical prices and another table which only has the current prices.
Using data type float can be troublesome. Please be sure you know what you are doing. If not, use a database currency data type.
As your webservice is transactional it is better if you'd have to access less tables at the same time. Since you will be reading and writing a lot, I would suggest having a single table.
Its better to insert a field to the CurrencyTable and update it rather than hitting two tables for a single request.

Critique my MySQL Database Design for Unlimited DYNAMIC Fields

Looking for a scalable, flexible and fast database design for 'Build your own form' style website - e.g Wufoo.
Rules:
User has only 1 Form they can build
User can create their own fields or choose from 'standard' fields
User's 1 Form has as many fields as the user wants
Values can be the sibling of another value E.g A photo value could have name, location, width, height as sibling values
Special Rules:
User can submit their form a maximum of 5 times a day
Value's Date is important
Flexibility to report on values (for single user, across all users, 1 field, many fields) is very important -- data visualization (most will be chronologically based e.g. all photos for July 2009 for all users).
Table "users"
uid
Table "field_user" - assign a field to a users form
fid
uid
weight - int - used to order the fields on the users form
Table "fields"
fid
creator_uid - int - the field 'creator'
label - varchar - e.g. Email
value_type - varchar - used to determine what field in the 'values' table will be filled in (e.g. if 'int' then values of this field will submit data into the values.type_int field - and all other .type_x fields will be NULL).
field_type - varchar - e.g. 'email' - used for special conditions e.g. validation rules
Table "values"
vid
parent_vid
fid
uid
date - date
date_group - int - value 1-5 (user may submit max of 5 forms per day)
type_varchar - varchar
type_text - text
type_int - int
type_float - float
type_bool - bool
type_date - date
type_timestamp - timestamp
I understand that this approach will mean records in the 'Value' table will only have 1 piece of data with other .type_x fields containing NULL's... but from my understanding this design will be the 'fastest' solution (less queries, less join tables)
At OSCON yesterday, Josh Berkus gave a good tutorial on DB design, and he spent a good fraction of it mercilessly tearing into such "EAV"il tables; you should be able to find his slides on the OSCON site soon, and eventually the audio recording of his whole tutorial online (the latter will probably take a while).
You'll need a join per attribute (multiple instances of the values table, one per attribute you're fetching or updating) so I don't know what you mean by "less join tables". Joining many instances of the same table isn't a particularly fast operation, and your design makes indices nearly unfeasible and unusable.
At least as a minor improvement use per-type separate tables for your attributes' values (maybe some indexing might be applicable in that case, though with MySQL's limitation to one index per query per table even that's somewhat dubious).
You should really look into schema-free dbs like CouchDB, problems like this are exactly those these types of DBs want to solve.
y'know, create table, alter, add a column, etc are operations you can do at run time in many modern rdbms implementations. Why be EAVil? Especially if you are using dynamic sql.
It's not for the fainthearted. I recall an implementation at Boeing which resulted in 70,000 tables in a database.
Obviously there are pitfalls in dynamic table creation, but they also exist for EAV tables. Things like two attributes for the same key expressing the same fact. Or transitive dependencies and other normalization gotchas. So why not at least leverage the power of the RDBMS on your behalf?
I agree with john owen.
dynamically creating a query from the schema is a small price to pay compared to querying EVA tables. Especially if the tables are large.
Usually table columns are considered an "interface". A design that relies on a dynamically changing interface is usually bad, but EAV data is a special case where you don't have many options. You have to choose between slow unintuitive queries or dynamic schema.