It appears to be a common practice to let the time dimension of OLAP cubes be in a table of its own, like the other dimensions.
My question is: why?
I simply don't see what the advantage would be to have a time_dimension table of (int, timestamp) that is joined with the cube on some time_id foreign key, instead of having a timestamp column in the cube itself.
Principally, points in time are immutable and constant, and they are their own value. I don't find it very likely that one would want to change the associated value for a given time_id.
In addition, the timestamp column type is 4 bytes wide (in MySQL), as is the int type that would otherwise typically be the key, so cannot be to save space either.
In discussing this with my colleagues, the only somewhat sensible argument I have been able to come up with is conformity with the other dimensions. But I find this argument rather weak.
I believe that it's often because the time dimension table contains a number of columns such as week/month/year/quarter, which allows for faster queries to get all of X for a particular quarter.
Given that the majority of OLAP cubes are written to get queries over time, this makes sense to me.
Paddy is correct, the time dimension contains useful "aliases" for the time primitives. You can capture useful information about the dates themselves such as quarter, national holiday, etc. You can write much quicker queries this way because there's no need to code every holiday in your query.
Related
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
I am trying to take information from one MySQL table, perform a bunch of calculations on this data, and then put the results in a second MySQL table. What would be the best way of doing this (i.e. in MySQL itself, using python, etc.)?
My apologies for the vagueness, I'll try to be more specific. Table 1 has every meal that every person in my class eats, so each meal is a primary key, and other columns include the person and the number of calories. The primary key for Table 2 is the person, and another column is the percentage of total calories this person has eaten, out of the calories of the entire class. Another column is the percentage of total calories of this person's gender in the class. Every day, I want to take the new eating information, and use it to update the percentages in Table 2. (Thanks for the help!)
Assming the calculations can be done in SQL (and percentages are definitely do-able), you have some choices.
The first, and academically correct, choice, is not to store this in a table at all. One of the principles of normalization is that you don't store duplicate or calculated values - instead, you calculate them as you need them.
This isn't just an academic concern - it avoids many silly bugs and anomalies, and it means your data is always up to date - you don't have to wait for your calculation query to run before you can use the data.
If the calculation is non-trivial and/or an essential part of the business domain, common practice is to create a database view, which behaves like a table when queried, but is actually calculated on the fly. This means that the business logic is encapsulated in the view, rather than repeated in multiple queries. You can go further, with materialized views etc. - but the basic principle is the same.
In some cases, where the volume of data is huge, or the calculations are time consuming, or you have calculations that are very hard to embed in a single SQL statement, it's common to create "aggregate tables" - this is what you are suggesting. You can populate these tables either by (scheduled) queries, or by using database triggers.
However, aggregate tables are a last resort - they make the solution much harder to maintain and debug - if the data is wrong, you don't have a single query to debug, you've got to follow the chain of logic all the way through.
Assuming you are in a class of a few dozen people, and are reporting on less than 10 million years of meals, any modern RDBMS can calculate this report in milliseconds - there's really no need to store it in an aggregate table.
A possible solution could be that you create a View or a Materialized View with the complex SELECT query behind it.
The Materialized View could be an other option too, as you have wrote that you would like to have these results re-queried/refreshed every day.
If you need to do more advanced operations on those tables, you could create a Stored procedure and call it when you need its data.
Note: you can't work furthermore (eg.: can't call it from a select for joining it's result set) with the procedures result set other than say a temporary table.
We are re-factoring our database, we will be adding about 100.000 rows to the db a day.
Each Row now contains a date field (yy-mm-dd), So today we entered the date "2013-11-29" 100.000 times into a table.
Would it make sense to breakout the dates into a separate table, and use it's id instead since we don't store the time?
There is a trade-of there. If we break it out, we will be adding yet another JOIN to the query used when we want to view the records later. We already join 3 tables when we query for information, The database consist of about 10 million entry.
Any thought? the database is growing huge so we need to thing about disk-space but also preformens.
The driving consideration for breaking out the dates into a separate table should not be storage. This is a question about the data model. As Gareth points out, the size of the built-in date and the size of the foreign key are about the same.
The real question is whether you need additional information about dates that you cannot readily get from the built-in date functions. MySQL has a rich set of date functions, where you can get the day of the week, format the dates into strings, extract components, and so on. In addition, the built in functional also handle comparisons, differences, and adding "intervals" to dates.
If these are sufficient, then use the built-in functions. On the other hand, if you have "business rules" surrounding dates, then consider another table. Here are examples of such business rules:
A special set of dates that might be holidays. Or worse, country-dependent holidays.
Special business definitions for "quarter", "year", "week-of-year" and so on.
The need to support multiple date formats for internationalization purposes.
Special rules about date types, such as "workday" versus "weekend".
A date table is only one possible solution for handling these issues. But if you need to support these, then having a separate date table starts to make sense.
It probably doesn't make any sense to do this since the storage requirements increase just a bit, and if you ever want to start storing the time, you're sort of stuck.
I'm creating a database to store a lot of events. There will be a lot of them and they will each have an associated time that is precise to the second. As an example, something like this:
Event
-----
Timestamp
ActionType (FK)
Source (FK)
Target (FK)
Actions, Sources, and Targets are all in 6NF. I'd like to keep the Event table normalized, but all of the approaches I could think of have problems. To be clear about my expectations for the data, the vast majority (99.9%) of events will be unique with just the above four fields (so I can use the whole row as a PK), but the few exceptions can't be ignored.
Use a Surrogate Key: If I use a four-byte integer this is possible, but it seems like just inflating the table for no reason. Additionally I'm concerned about using the database over a long period of time and exhausting the key space.
Add a Count Column to Event: Since I expect small counts I could use a smaller datatype and this would have a smaller effect on database size, but it would require upserts or pooling the data outside the database before insertion. Either of those would add complexity and influence my choice of database software (I was thinking of going with Postgres, which does upserts, but not gladly.)
Break Events into small groups: For example, all events in the same second could be part of a Bundle which could have a surrogate key for the group and another for each event inside it. This adds another layer of abstraction and size to the database. It would be a good idea if otherwise-duplicate events become common, but otherwise seems like overkill.
While all of these are doable, they feel like a poor fit for my data. I was thinking of just doing a typical Snowflake and not enforcing a uniqueness constraint on the main Event table, but after reading PerformanceDBA answers like this one I thought maybe there was a better way.
So, what is the right way to keep time-series data with a small number of repeated events normalized?
Edit: Clarification - the sources for the data are logs, mostly flat files but some in various databases. One goal of this database is to unify them. None of the sources have time resolution more precise than to the second. The data will be used for questions like "How many different Sources executed Action on Target over Interval?" where Interval will not be less than an hour.
The simplest answers seem to be
store the timestamp with greater precision, or
store the timestamp to the second and retry (with a slightly later timestamp) if INSERT fails because of a duplicate key.
None of the three ideas you mention have anything to do with normalization. These are decisions about what to store; at the conceptual level, you normalize after you decide what to store. What the row means (so, what each column means) is significant; these meanings make up the table's predicate. The predicate lets you derive new true facts from older true facts.
Using an integer as a surrogate key, you're unlikely to exhaust the key space. But you still have to declare the natural key, so a surrogate in this case doesn't do anything useful for you.
Adding a "count" colummn makes sense if it makes sense to count things; otherwise it doesn't. Look at these two examples.
Timestamp ActionType Source Target
--
2013-02-02 08:00:01 Wibble SysA SysB
2013-02-02 08:00:02 Wibble SysA SysB
Timestamp ActionType Source Target Count
--
2013-02-02 08:00:01 Wibble SysA SysB 2
What's the difference in meaning here? The meaning of "Timestamp" is particularly important. Normalization is based on semantics; what you need to do depends on what the data means, not on what the columns are named.
Breaking events into small groups might make sense (like adding a "count" column might make sense) if groups of events have meaning in your system.
I have an SQL Server 2008 SSIS/SSAS Datawarehouse cube that I am publishing, in this cube I have the following:
Dimensions
----------
Company
Product
Sales Person
Shipped Date (time dimension)
Facts
-----
Total Income
Total Revenue
Gross
For the above, I have setup primary (PK) / surrogate (SK) keys for the dimension/fact data referencing.
What I would also like to include is things such as Order Number or Transaction Number which in my mind would fit in a fact table as the order number is different for every record. If I were to create a order number dimension it does not make much sense as I would have as many order numbers as I would facts.
Right now, when I load my fact data I do multiple Lookups on the dimensions to get the surrogate keys, I then pass in the fact data and also include these Order Number and Transaction Number varchar columns when I load my fact data but they cannot be used as they are not something you can aggregate on so they don't show up in my SSAS, only columns of numeric data type do for the fact table (total income, total revenue, etc).
Is there something I can do to make these available for anyone using the Cube to filter on?
Invoice number is a perfect candidate for a degenerate dimension.
It can be included in your fact table, yet not be linked to any dimension. These sorts of numbers are not useful in analytics except when you want to drill down and investigate and need to trace back a record to your source system, and they don't have any sensible "dimensionality". Kimball calls them degenerate dimensions. In SSAS they are called "fact dimensions"
http://msdn.microsoft.com/en-us/library/ms175669(v=sql.90).aspx
You are essentially putting an attribute column into the fact table, rather than a dimension table.
One important tip. In dimensional modelling, yes you are trying to do a star schema with perfectly formed dimensions, but don't be afraid to ignore the ideal when it comes to practical implementation. Kimball even says this, sometimes you need to break the rules, with the caveat that you test your solution. If it's quick then do it! If conforming to the Kimball ideal makes it slower or adds unecessary complexity, avoid it.