I have a 'users' table which has a bunch of concrete "sure" properties about my users all of which must be there and their veracity is certain and then I have a separate table 'users_derived' where all data in this table is derived properties of my users guessed by machine learning models. For example: 'age' might be a certain property since they supplied it to me, 'height' or 'hair color' might be a derived property since an ML model guessed it from a picture. The main difference is all properties in the 'users' table were given to me by the user themselves and have complete certainty whereas all properties in the 'user_derived' table have both the value and a certainty associated with it and were guessed at by my system. The other difference is all properties of the 'users' table will be there for every user, while any property in the 'users_derived' table may or may not be there. From time to time I add new ML models which guess at more properties of users as well.
My question is how to do the schema for the 'users_derived' table. I could do it like this:
userid | prop1 | certainty1 | prop2 | certainty2 | prop3 | etc ...
123 7 0.57 5'8'' 0.82 red
124 12 0.6 NULL NULL black
125 NULL NULL 6'1'' 0.88 blonde
or I could do it like this with slightly different indexing:
userid | property | value | certainty
123 1 7 0.57
123 2 5'8'' 0.82
124 1 12 0.60
123 3 red 0.67
124 3 black 0.61
125 2 6'1'' 0.88
etc ....
So the tradeoffs seem like in the second way it isn't as normalized and might be slightly harder to query but you don't have to know all the properties you care about in advance -- that is if I want to add a new property there is no schema change. Also there don't have to be any NULL spots since if we don't have that property yet we just don't have a row for it. What am I missing? What are the benefits of the first way? Are there queries I can do against the first schema that are hard or impossible in the second schema? Does the second way somehow need more space for indexing to make it fast?
The second way is more normalized. Both the table and the indexes are likely to be more compact, especially if the first form is relatively sparsely populated. Although the two forms have different tradeoffs for different queries, in general the second form is more flexible and better suited to a wide variety of queries. If you want to transform data from the normalized form to the crosstabbed form, there is a crosstab function in Postgres' tablefunc extension that can be used for this purpose. Normalizing crosstabbed data will be more difficult, especially if the number of columns is indeterminate--yet you may need to do that for some types of queries.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say I want to create a table like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 Weight 100
2 1 Reps 5
3 2 Reps 40
4 3 Time 10
5 4 Weight 50
6 4 Reps 60
Versus the same data represented this way
id | some_foreign_id | weight | reps | time
____________________________________________
1 1 100 5 NULL
2 2 NULL 40 NULL
3 3 NULL NULL 10
4 4 50 60 NULL
And since in this case the id = foreign_id I think we can just append these columns to whatever table foreign_id is referring to.
I would assume most people would overwhelmingly say the latter approach is the accepted practice.
Is the former approach considered a bad idea, even though it doesn't result in any NULLs? What are the tradeoffs between these two approaches exactly? It seems like the former might be more versatile, at the expense of not really having a clear defined structure, but I don't know if this would actually result in other ramifications. I can imagine a situation where you have tons of columns in the latter example, most of which are NULL, and maybe only like three distinct values filled in.
EAV is the model your first example is in. It's got a few advantages, however you are in mysql and mysql doesn't handle this the best. As pointed out in this thread Crosstab View in mySQL? mysql lacks functions that other databases have. Postgres and other databases have some more fun functions PostgreSQL Crosstab Query that make this significantly easier. In the MSSQL world, this gets referred to as sparsely populated columns. I find columnar structures actually lend themselves quite well to this (vertica, or high end oracle)
Advantages:
Adding a new column to this is significantly easier than altering a table schema. If you are unsure of what future column names will be, this is the way to go
Sparsely populated columns result in tables full of nulls and redundant data. You can setup logic to create a 'default' value for a column...IE if no value is specified for this attribute, then use this value.
Downsides:
A bit harder to program with in MySQL in particular as per comments above. Not all SQL dev's are familiar with the model and you might accidentally implement a steeper learning curve for new resources.
Not the most scalable. Indexing is a challenge and you need work around (Strawberry's input in the comments is towards this, your value column is basically forced to Varchar and that does not index well, nor does it search easily...welcome to table scan hell) . Though you can get around this with a third table (say you query on dates like create date and close date alot. Create a third 'control' table that contains those frequently queried columns and index that...refer to the EAV tables from there) or creating multiple EAV tables, one for each data type.
First one is the right one.
If later you want change the number of properties, you dont have to change your DB structure.
Changing db structure can cause your app to break.
If the number of null is too big you are wasting lot of storage.
My take on this
The first I would probably use if I have a lot of different attributes and values I would like to add in a more dynamic way, like user tags or user specific information etc,
The second one I would probably use if I just have the three attributes (as in your example) weights, reps, time and have no need for anything dynamic or need to add any more attributes (if this was the case, I would just add another column)
I would say both works, it is as you yourself say, "the former might be more versatile". Both ways needs their own structure around them to extract, process and store data :)
Edit: for the first one to achieve the structure of the second one, you would have to add a join for each attribute you would want to include in the data extract.
I think the first way contributes better towards normalization. You could even create a new table with attributes:
id attribute
______________
1 reps
2 weight
3 time
And turn the second last column into a foreign id. This will save space and will save you the risk of mistyping the attribute names. Like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 2 100
2 1 1 5
3 2 1 40
4 3 3 10
5 4 2 50
6 4 1 60
As others have stated, the first way is the better way. Why? Well, it normalizes the structure. Reference: https://en.wikipedia.org/wiki/Database_normalization
As that article states, normalization reduces database size & allows for easy expansion.
I'm creating a MySQL database with tables that contain information about different types of products.
As an example, let's say Table1 contains bicycles and Table2 contains t-shirts.
I want to be able to store information about things like which colors each of the items in each table are.
For example, there might be a bicycle in Table1 that's blue and yellow, and a t-shirt in Table2 that's red, green, and orange.
Originally I had intended to store color information as binary numbers in each table and use bit masking to figure out the colors of a particular object (i.e. 1 = red, 2 = blue, 4 = green, 8 = orange - if the value is 5, the object is Red and Green). I was going to have a foreign key table with the values for all the single colors (i.e. Red = 1, Green = 4) and use sums of the values from that table as bit masks.
I assumed doing it this way would be "faster", but I've been "Googling" this subject for weeks before making a decision and found out that it's "faster" to have a foreign key table so indexes can be used. (i.e., if you wanted to see if a t-shirt with the color value set to 13 included the colors Red and Green, rather than doing "13 & 5" , you would check row 13 in the foreign key table to see if the values for Red and Green were set to 1.)
The thing is, the list of colors I'm using is currently at 26, and I'm anticipating that it will grow. (I was trying not to go over 31 colors so I could use an INT column to store the values, where 0 = "none".) If I were to make a foreign key table to cover all possible combinations of 31 colors, it would have to have 2,147,483,647 rows and 32 columns (one true/false column for each possible color). Every time another color was added, I would have to double the number of rows in the table (like, one additional color would require 2147483648 more rows).
I assume it would be preferable to make a "junction table" like this:
+----------+------------+
| shirt_id | color_id |
+----------+------------+
| 1 | 1 (Red) |
| 1 | 4 (Green) |
| 1 | 8 (Orange) |
| 2 | 2 (Blue) |
| 2 | 4 (Green) |
+----------+------------+
Then there wouldn't need to be a gigantic table listing every possible combination (the vast majority of which might never be used). The thing is, there would have to be junction table for every product type, and there are going to be a large number of product types, meaning a large number of junction tables.
I'm using colors as an example, but I had actually planned to do this for several other "stackable" values as well (for example, a single object could be composed of hardwood and aluminum and glass and particle board and ABS plastic and PVC and cardboard...and so on, all at the same time).
My question is, what is the most efficient method of handling situations like this? Is there a method I haven't thought of that's preferred over these?
I'm only using colors as an example - the database will actually have a number of "stackable" attributes like this (things like material, fiber type, texture, finish, etc.) that can apply to more than one product type, and the "products" themselves will be "generic" and have have a "stackable" value that indicates the types of components that make them up (like, a "product" that includes a bicycle and a t-shirt packaged together).
Having written this, I imagine using multiple junction tables would be the most efficient way to do it. But as an "old-school programmer", it's difficult for me to get my head around the idea that making [for example] 30 different junction tables just for product component/color combinations alone could possibly be "preferable" to just directly analyzing bits in a binary value. (I do realize MySQL is not a Nintendo Entertaiment System...)
I've once implemented bit-masking on a field for different domains. However this was clearly a case that was going to provide a big performance improvement as it would avoid having to join 8~10 tables. Bit-masking is extremely fast, especially if the field is indexed.
With the index for a 32-bit field then it will at maximum do 31 comparisons to find the resulting rows.
Without the index it would still have to perform the bit-compare on every row.
However there is a big 'if'. It's not easy to maintain and the shirt colors will always be limited to the bit-length and in the case that you describe I would really opt for the junction table and just make sure to have the index on your foreign keys.
The question of performance depends on the queries being used, as well as the structure of the data. Your question doesn't include information on the queries.
But, there seems little reason not to use a junction table. This would involve a table called Colors with an auto-incremented primary ColorId. Then for each table that required colors, you would have a table, such as BikeColors with one row per bike and color.
I wouldn't attempt to do this using bit-fiddling, unless you have a really good reason to. That is, unless you have tried a junction table, and for some reason that doesn't meet your needs. A junction table can take advantage of indexes. Bit fiddling generally does not.
Also, I would question why you have separate tables for bikes and T-shirts, unless you have a lot of columns that differ between them. For most retailing purposes, one table would be sufficient for multiple products.
I have a database for a device and the columns are like this:
DeviceID | DeviceParameter1 | DeviceParameter2
At this stage I need only these parameters, but maybe a few months down the line, I may need a few more devices which have more parameters, so I'll have to add DeviceParameter3 etc as columns.
A friend suggested that I keep the parameters as rows in another table (ParamCol) like this:
Column | ColumnNumber
---------------------------------
DeviceParameter1 | 1
DeviceParameter2 | 2
DeviceParameter3 | 3
and then refer to the columns like this:
DeviceID | ColumnNumber <- this is from the ParamCol table
---------------------------------------------------
switchA | 1
switchA | 2
routerB | 1
routerB | 2
routerC | 3
He says that for 3NF, when we expect a table whose columns may increase dynamically, it's better to keep the columns as rows. I don't believe him.
In your opinion, is this really the best way to handle a situation where the columns may increase or is there a better way to design a database for such a situation?
This is a "generic data model" question - if you google the term you'll find quite a bit of material on the net.
Here is my view: if and only if the parameters are NOT qualitatively different from the application perspective, then go with the dynamic row solution (i.e. a generic data model). What does qualitatively mean - it means that within your application you don't treat Parameter3 any different to Parameter17.
You should never ever generate new columns on-the-fly, that's a very bad idea. If the columns are qualitatively different and you want to be able to cater for new ones, then you could have a different Device Parameter table for each different category of parameters. The idea is to avoid dynamic SQL as much as possible as it brings a set of its own problems.
Adding dynamic column is a bad idea, Actually it's a bad design. I would agree with your second option , Adding rows is OK,
Because if you want to add dynamically grow the columns then you have to provide them a default value, also you will not be able to use them as 'UNIQUE' vals, you will find really hard while updating the tables, So better to stick with adding 'ROWS' plan.
Im working on a project. Its mostly for learning purposes, i find actually trying a complicated project is the best way to learn a language after grasping the basics. Database design is not a strong point, i started reading up on it but its early days and im still learning.
Here is my alpha schema, im really at the point where im just trying to jot down everything i can think of and seeing if any issues jump out.
http://diagrams.seaquail.net/Diagram.aspx?ID=10094#
Some of my concerns i would like feedback on:
Notice for the core attributes like area for example, lets say for simplicity the areas are kitchen,bedroom,garden,bathroom and living room. For another customer that might be homepage,contact page,about_us,splash screen. It could be 2 areas and it could be 100, there isn't a need to limit it.
I created separate tables for the defaults and each is linked to a bug. Later i came to the problem of custom fields, if someone wants for example to mark which theme the bug applies to we dont have that, there is probably a 100 other things so i wanted to stick to a core set of attributes and the custom fields give people flexibility.
However when i got to the custom fields i knew i had an issue, i cant be creating a table for every custom field so i instead used 2 tables. custom fields and custom_field_values. The idea is every field including defaults would be stored in this table and each would be linked to the values table which would just have something like this
custom_fields table
id project_id name
01 1 area(default)
12 2 rooms(custom)
13 4 website(custom)
custom_field_values table
id area project_id sort_number
667 area1 1 1
668 area2 1 2
669 area3 1 3
670 area4 1 4
671 bedroom 2 1
672 bathroom 2 2
673 garden 2 3
674 livingroom 2 4
675 homepage 4 1
676 about_us 4 2
677 contact 4 3
678 splash page 4 4
Does this look like an efficient way to handle dynamic fields like this or is there other alternatives?
The defaults would be hard coded so you can either use them or replace with your own or i could create a another table to allow users to edit the name of the defaults which would be linked to their project. Any feedback is welcome and if there something very obvious with issues in the scheme please feel free to critique.
You have reinvented an old antipattern called Entity-Attribute-Value. The idea of custom fields in a table is really logically incompatible with a relational database. A relation has a fixed number of fields.
But even though it isn't properly relational, we still need to do it sometimes.
There are a few methods to mimic custom fields in SQL, though most of them break rules of normalization. For some examples, see:
Product table, many kinds of product, each product has many parameters on StackOverflow
My presentation Extensible Data Modeling with MySQL
My book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming
I found this as was searching for something similar as the customer can submit custom fields for use later.
i settled for using data type JSON which i appreciate was not available when this question was asked.
I have a table with this structure:
col1 would be "product_name" and col2 "product_name_abbreviated".
Ignoring the id colum I've this data:
1 1 43
1 1 5
1 1 6
1 1 7
1 1 8
2 2 9
2 2 10
2 2 34
2 2 37
2 2 38
2 2 39
2 2 50
I can do another table and put there col1 and col2 columns becouse they are repeated. Something like this:
But I'm sure that it'll not be repeated more than 15 times, so... Is it worth?
Thanks in advanced.
Yes, you should split them out into separate tables - this is an example of normalisation to Second Normal Form.
You are sure NOW, but what about when you will extend your application in one year time? Split the tables
Use only one table with the ID, two VARCHAR columns for the name and abbreviation and a NUMBER for the price.
Normalization is good for avoiding repeating data. Your model is tiny, the data is small, you should not worry and leave one entity (table).
In real projects sometimes we normalize and then realize we got a mess. It's always good to balance between repeating data and easy of understanding the model and querying. Not to mention when working with data warehouse databases...
This is a very basic question in database design and the answer is a resounding "Two Tables"!
Here are just some of the reasons:
If you have one table, then by mistake someone could enter a new row with product name "1" and abbreviated product name "2" The only way to stop this would be to add rules and constraints - far more complicated than just splitting the tables in the first place.
Looking at the database schema should tell you meaningfully about what it represents. If it's a FACT that you can't have a product with product name "1" and abbreviated product name "2" then this should be clear from looking at the table structure. A single table tells you the opposite, which is UNTRUE. A database should tell the truth - otherwise it is misleading.
If anyone other than yourself looks at or develops against this database, they may be confused and misled by this deviation from such basic rules of design. Or worse, it could lead to broken window syndrome, if they assume it was not carefully designed and therefore don't take care with their own work.
The principle is called "Normalisation" and is at the heart of what it means for something to be a relational database rather than just some data in a pile :)