How to model attribute units in a database design? - mysql

I need to design a database table where most attributes have units. For example:
Readings
--------
id load (kW) fuel_consumption (tonnes) - etc
1 1154 89.4
2 1199 54.2
What's the recommended way to capture the units in the design? For example, I could:
store units within attribute names e.g. load_kW and fuel_consumption_tonnes
store units in a separate table e.g. each value becomes a foreign key to another table with columns for value and unit.
store outside the database: e.g. in business logic, or in documentation
are there others?
I happen to be using MySQL, but I assume this is a generic database normalisation problem.

Interesting question...
There are two obvious routes:
id load_kW fuel_consumption_tonnes
--------------------------------------------------
1 1154 89.4
2 1199 54.2
This is easy for humans to read, and fairly logical. However, if some readings are in "kilos", others in "tonnes", you have to convert those readings to fit into the "readings" table; this process MUST be "lossless", and idempotent. For instance, a reading of "89403 kilos" is not "89.4 tonnes", even though the business may choose to round from kilos to tonnes for convenience. There are usually some counter-intuitive rounding things that happen...
If that's the case, you could change the schema:
id load load_unit fuel_consumption fuel_consumption_unit
--------------------------------------------------
1 1154 kW 89403 kg
2 1199 kW 54.2 t
With a "unit" table, if you need it:
unit_id unit_name
--------------------
kg kilogramme
t Tonne
However, this model is open to human failure - it would be easy to change the "load_unit" column without modifying the "load" column, thus breaking the data. There's nothing you can really do to your data model to avoid this. It also makes common queries fairly tricky: imagine trying to retrieve the total of "load" in a consistent unit of measurement.
I would recommend that in this case, you have two tables: "raw_readings", with the original data in the format above, and "normalized_readings", which you populate by converting all the readings to a consistent unit of measurement.

It depends ultimately on what you intend or need to do with your quantities.
If (in the unlikely case) all you will ever do is record the values for later regurgitation, then it doesn't really matter what you do with units, since the scalar values have no semantic significance to your model.
It is much more likely to be the case that the scalars in your system have some importance to your system. This could be because you are performing calculations on them for example. In such a case your units matter very much.
The next question you need to answer for yourself is whether the units will always be consistent and must not be allowed to be changed. In most cases I would say that this is a risky conclusion. It could be a business rule that you impose through your system, but business rules have a nasty habit of changing.
For this reason I would recommend storing a unit of measure with every scalar that represents an actual measurement. Being explicit in this way takes a bit of disk space, but it gives you clarity and flexibility.
Something that I have done in the past is to extend the unit of measure model to include UOM types, like length, temperature, volume, time, etc. Keeping a table that maps each UOM to a UOM Type allows you to also store conversion factors. That way, if someone should come to you with a reading in BHP and pounds you would know what to do with it and how to compare it to your typical entries in kW and tonnes.

Related

MySQL normalizing data not always present

I'm storing the information for Magic the Gathering cards in a data base, my doubt is how to store information that is not always present.
i.e. creatures have a power/toughness, while other cards don't. should I store it for every card and leave it empty for the ones that don't have it? or create another table to store this information, like:
cardID power resist
how would this affect query times?
From a normalization point of view, power and resistance are functionally dependent from the cardID, and should thus be on the same table as all the other things which depend from the cardID.
If I correctly remember how magic works, you would have one big table with several columns for all the features of a card: power, resistance, cost (which I would model as a VARCHAR), picture.
Then, since a 1-N (or M-N, don't remember) relationship exists between effects and cards, you will have an extra table with the effects, which use the cardID as a foreign key (or maybe a dispatch table, if it's M-N).
Anyway, I found only 4 fields + ID for the cards, it isn't so much you have to worry, it's quite a litte, rather. Even if you wanted to model the cost as multiple INTs, the number of elements is limited (5-6?). A table with 10, or even 20 columns is not a problem for a DBMS.
Maybe yo're going to have a worse time concerning the effects (if you're planning to implement them).

Normalizing Time Series Data

I'm creating a database to store a lot of events. There will be a lot of them and they will each have an associated time that is precise to the second. As an example, something like this:
Event
-----
Timestamp
ActionType (FK)
Source (FK)
Target (FK)
Actions, Sources, and Targets are all in 6NF. I'd like to keep the Event table normalized, but all of the approaches I could think of have problems. To be clear about my expectations for the data, the vast majority (99.9%) of events will be unique with just the above four fields (so I can use the whole row as a PK), but the few exceptions can't be ignored.
Use a Surrogate Key: If I use a four-byte integer this is possible, but it seems like just inflating the table for no reason. Additionally I'm concerned about using the database over a long period of time and exhausting the key space.
Add a Count Column to Event: Since I expect small counts I could use a smaller datatype and this would have a smaller effect on database size, but it would require upserts or pooling the data outside the database before insertion. Either of those would add complexity and influence my choice of database software (I was thinking of going with Postgres, which does upserts, but not gladly.)
Break Events into small groups: For example, all events in the same second could be part of a Bundle which could have a surrogate key for the group and another for each event inside it. This adds another layer of abstraction and size to the database. It would be a good idea if otherwise-duplicate events become common, but otherwise seems like overkill.
While all of these are doable, they feel like a poor fit for my data. I was thinking of just doing a typical Snowflake and not enforcing a uniqueness constraint on the main Event table, but after reading PerformanceDBA answers like this one I thought maybe there was a better way.
So, what is the right way to keep time-series data with a small number of repeated events normalized?
Edit: Clarification - the sources for the data are logs, mostly flat files but some in various databases. One goal of this database is to unify them. None of the sources have time resolution more precise than to the second. The data will be used for questions like "How many different Sources executed Action on Target over Interval?" where Interval will not be less than an hour.
The simplest answers seem to be
store the timestamp with greater precision, or
store the timestamp to the second and retry (with a slightly later timestamp) if INSERT fails because of a duplicate key.
None of the three ideas you mention have anything to do with normalization. These are decisions about what to store; at the conceptual level, you normalize after you decide what to store. What the row means (so, what each column means) is significant; these meanings make up the table's predicate. The predicate lets you derive new true facts from older true facts.
Using an integer as a surrogate key, you're unlikely to exhaust the key space. But you still have to declare the natural key, so a surrogate in this case doesn't do anything useful for you.
Adding a "count" colummn makes sense if it makes sense to count things; otherwise it doesn't. Look at these two examples.
Timestamp ActionType Source Target
--
2013-02-02 08:00:01 Wibble SysA SysB
2013-02-02 08:00:02 Wibble SysA SysB
Timestamp ActionType Source Target Count
--
2013-02-02 08:00:01 Wibble SysA SysB 2
What's the difference in meaning here? The meaning of "Timestamp" is particularly important. Normalization is based on semantics; what you need to do depends on what the data means, not on what the columns are named.
Breaking events into small groups might make sense (like adding a "count" column might make sense) if groups of events have meaning in your system.

How do I handle this scenario related to degenerate/fact dimension in SSAS cube?

I have an SQL Server 2008 SSIS/SSAS Datawarehouse cube that I am publishing, in this cube I have the following:
Dimensions
----------
Company
Product
Sales Person
Shipped Date (time dimension)
Facts
-----
Total Income
Total Revenue
Gross
For the above, I have setup primary (PK) / surrogate (SK) keys for the dimension/fact data referencing.
What I would also like to include is things such as Order Number or Transaction Number which in my mind would fit in a fact table as the order number is different for every record. If I were to create a order number dimension it does not make much sense as I would have as many order numbers as I would facts.
Right now, when I load my fact data I do multiple Lookups on the dimensions to get the surrogate keys, I then pass in the fact data and also include these Order Number and Transaction Number varchar columns when I load my fact data but they cannot be used as they are not something you can aggregate on so they don't show up in my SSAS, only columns of numeric data type do for the fact table (total income, total revenue, etc).
Is there something I can do to make these available for anyone using the Cube to filter on?
Invoice number is a perfect candidate for a degenerate dimension.
It can be included in your fact table, yet not be linked to any dimension. These sorts of numbers are not useful in analytics except when you want to drill down and investigate and need to trace back a record to your source system, and they don't have any sensible "dimensionality". Kimball calls them degenerate dimensions. In SSAS they are called "fact dimensions"
http://msdn.microsoft.com/en-us/library/ms175669(v=sql.90).aspx
You are essentially putting an attribute column into the fact table, rather than a dimension table.
One important tip. In dimensional modelling, yes you are trying to do a star schema with perfectly formed dimensions, but don't be afraid to ignore the ideal when it comes to practical implementation. Kimball even says this, sometimes you need to break the rules, with the caveat that you test your solution. If it's quick then do it! If conforming to the Kimball ideal makes it slower or adds unecessary complexity, avoid it.

How many columns in table to keep? - MySQL

I am stuck between row vs columns table design for storing some items but the decision is which table is easier to manage and if columns then how many columns are best to have? For example I have object meta data, ideally there are 45 pieces of information (after being normalized) on the same level that i need to store per object. So is 45 columns in a heavry read/write table good? Can it work flawless in a real world situation of heavy concurrent read/writes?
If all or most of your columns are filled with data and this number is fixed, then just use 45 fields. It's nothing inherently bad with 45 columns.
If all conditions are met:
You have a possibility of the the attributes which are neither known nor can be predicted at design time
The attributes are only occasionally filled (say, 10 or less per entity)
There are many possible attributes (hundreds or more)
No attribute is filled for most entities
then you have a such called sparce matrix. This (and only this) model can be better represented with an EAV table.
"There is a hard limit of 4096 columns per table", it should be just fine.
Taking the "easier to manage" part of the question:
If the property names you are collecting do not change, then columns is just fine. Even if it's sparsely populated, disk space is cheap.
However, if you have up to 45 properties per item (row) but those properties might be radically different from one element to another then using rows is better.
For example taking a product catalog. One product might have color, weight, and height. Another might have a number of buttons or handles. These are obviously radically different properties. Further this type of data suggests that new properties will be added that might only be related to a particular set of products. In this case, rows is much better.
Another option is to go NoSql and utilize a document based database server. This would allow you to set the named "columns" on a per item basis.
All of that said, management of rows will be done by the application. This will require some advanced DB skills. Management of columns will be done by the developer at design time; which is usually easier for most people to get their minds around.
I don't know if I'm correct but I once read in MySQL to keep your table with minimum columns IF POSSIBLE, (read: http://dev.mysql.com/doc/refman/5.0/en/data-size.html ), do NOTE: this is if you are using MySQL, I don't know if their concept applies to other DBMS like oracle, firebird, posgresql, etc.
You could take a look at your table with 45 column and analyze what you truly need and leave the optional fields into other table.
Hope it helps, good luck

Datatype for unit of measurement in database

For my application I need to keep the prefered unit of measurement of a user.
The possible units currently are:
Liter (the unit the values in the rest of my database are stored in)
Kilogram (varries with the density of the products)
US Liquid Gallon (3.785411784 litres)
US Liquid Quart (1/4th of above)
UK Liquid Gallon (4.54609 litres)
UK Liquid Quart (1/4th of above)
I need a way to save these units in an mssql 2005 (and up) database so that there can be no ambiguity and preferably without all the applications keeping an enumeration and without having to create an extra table.
Using an ISO abbreviation would work for the first two, but AFAIK there is none for the last four.
Using the string representation is also asking for trouble..
So besides of finally getting through to the project manager about not using retarded units of measurement, what other suggestions do you have?
I know you don't want to create a new table, but in all honesty, it's the Right Thing™ to do. Add a column with a foreign key reference, and just do it - it'll work better in the end!
I think you need to reconsider using a table to store these values. The main reason being that you will want to convert from one unit of measure to another and you need to decide on the number of significant digits that is important to your application.
If you have a table, then you can store the litre to X conversion value in the record. This will help keeping all of the other applications in sync in order to reduce rounding and comparison problems.