I have a table with this structure:
col1 would be "product_name" and col2 "product_name_abbreviated".
Ignoring the id colum I've this data:
1 1 43
1 1 5
1 1 6
1 1 7
1 1 8
2 2 9
2 2 10
2 2 34
2 2 37
2 2 38
2 2 39
2 2 50
I can do another table and put there col1 and col2 columns becouse they are repeated. Something like this:
But I'm sure that it'll not be repeated more than 15 times, so... Is it worth?
Thanks in advanced.
Yes, you should split them out into separate tables - this is an example of normalisation to Second Normal Form.
You are sure NOW, but what about when you will extend your application in one year time? Split the tables
Use only one table with the ID, two VARCHAR columns for the name and abbreviation and a NUMBER for the price.
Normalization is good for avoiding repeating data. Your model is tiny, the data is small, you should not worry and leave one entity (table).
In real projects sometimes we normalize and then realize we got a mess. It's always good to balance between repeating data and easy of understanding the model and querying. Not to mention when working with data warehouse databases...
This is a very basic question in database design and the answer is a resounding "Two Tables"!
Here are just some of the reasons:
If you have one table, then by mistake someone could enter a new row with product name "1" and abbreviated product name "2" The only way to stop this would be to add rules and constraints - far more complicated than just splitting the tables in the first place.
Looking at the database schema should tell you meaningfully about what it represents. If it's a FACT that you can't have a product with product name "1" and abbreviated product name "2" then this should be clear from looking at the table structure. A single table tells you the opposite, which is UNTRUE. A database should tell the truth - otherwise it is misleading.
If anyone other than yourself looks at or develops against this database, they may be confused and misled by this deviation from such basic rules of design. Or worse, it could lead to broken window syndrome, if they assume it was not carefully designed and therefore don't take care with their own work.
The principle is called "Normalisation" and is at the heart of what it means for something to be a relational database rather than just some data in a pile :)
Related
There are two tables - users and orders:
id
first_name
orders_amount_total
1
Jone
5634200
2
Mike
3982830
id
user_id
order_amount
1
1
200
2
1
150
3
2
70
4
1
320
5
2
20
6
2
10
7
2
85
8
1
25
The tables are linked by user id. The task is to show for each user the sum of all his orders, there can be thousands of them (orders), maybe tens of thousands, while there can be hundreds and thousands of users simultaneously making a request. There are two options:
With each new order, in addition to writing to the orders table, increase the orders_amount_total counter, and then simply show it to the user.
Remove the orders_amount_total field, and to show the sum of all orders using tables JOIN and use the SUM operator to calculate the sum of all orders of a particular user.
Which option is better to use? Why? Why is the other option bad?
P.S. I believe that the second option is concise and correct, given that the database is relational, but there are strong doubts about the load on the server, because the sample when calculating the amount is large even for one user, and there are many of them.
Option 2. is the correct one for the vast majority of cases.
Option 1. would cause data redundancy that may lead to inconsistencies. With option 2. you're on the safe side to always get the right values.
Yes, denormalizing tables can improve performance. But that's a last resort and great care needs to be taken. "tens of thousands" of rows isn't a particular large set for an RDMBS. They are built to handle even millions and more pretty well. So you seem to be far away from the last resort and should go with option 1. and proper indexes.
I agree with #sticky_bit that Option 2. is better than 1. There's another possibility:
Create a VIEW that's a pre-defined invocation of the JOIN/SUM query. A smart DBMS should be able to infer that each time the orders table is updated, it also needs to adjust orders_amount_total for the user_id.
BTW re your schema design: don't name columns id; don't use the same column name in two different tables except if they mean the same thing.
Take this table as an example :
CREATE TABLE UserServices (
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
Service1 TEXT,
Service2 TEXT,
.
.
.
) ENGINE = MYISAM;
Every user will have different number of services, so lets say the table starts with 10 columns for services for each user. If one user will have 11 services, must all other users have 11 columns also? Now of course it is a table and row needs to have the same number of columns, but it is just seems like an awful waste of memory. Maybe the use of another database type is better?
Thank you!!
Storing a boatload of nulls isn't really a "waste of memory" because the space is negligible - hard disks cost pence per gigabyte, programmers cost tens/hundreds of $/hr so it's certainly economical to burn the space and it's not really a great argument for avoidance.
There is a better argument though, as others have said; databases don't do variable numbers of columns for a particular ID in a table, but they DO do variable numbers of rows per ID.. This is how DBs are designed: columns are fixed, rows are variable. Everything that a database does and offers in terms of querying, storage, retrieval, internal design etc is optimised towards this pattern
There are well established operations (called pivots) that will turn your vertical arrangement of data into horizontal (with nulls) at query time, so you don't have to store the data horizontally
Here's a pivot example:
Table:
ID, ServiceIdentifier, ServiceOwner
1, SV1, John
1, SV2, Sarah
2, SV1, Phil
2, SV2, John
2, SV3, Joe
3, SV2, Mark
SELECT
ID,
MAX(CASE WHEN ServiceIdentifier = 'SV1' THEN ServiceOwner END) as SV1_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV2' THEN ServiceOwner END) as SV2_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV3' THEN ServiceOwner END) as SV3_Owner
FROM
Table
GROUP BY
ID
Result:
ID SV1_Owner SV2_Owner SV3_Owner
1 John Sarah
2 Phil John Joe
3 Mark
As noted, it's not a huge cost to just store the data horizontally and if you're sure the table will never change/ not need new columns adding on a weekly basis to cope with new services etc, then it might be a sensible developer optimisation to just have columns full of nulls. If you'll add columns regularly, or one day have thousands of services, then vertical storage is going to have to be the way it goes
To expand a little on what's already been said:
Is there a way to add an attribute to only 1 row in SQL?
No, and that's kinda fundamental to how relationship databases (SQL) work - and that's in any version of SQL, whether it's mysql, t-sql, etc. If you have a table - and you want to add an attribute to that table, it's going to be another column, and that column will be there for every row. Not just relational databases - that's just how tables work.
But, that's not how anyone would do it. What you would do is what Alan suggested - a separate table for Services, then a 3rd table (he suggested naming it 'UserServices') that links the two. And that's not a one-off suggestion - that's pretty much "the" way to do it. There's no waste.
Maybe the use of another database type is better?
Possibly, if you want something with less restrictions, then you could go with something other than SQL. Since SQL is so dominant, everything is usually categorized as NOSQL. - Mongo is the most popular NOSQL database currently, which is why RC brought it up.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say I want to create a table like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 Weight 100
2 1 Reps 5
3 2 Reps 40
4 3 Time 10
5 4 Weight 50
6 4 Reps 60
Versus the same data represented this way
id | some_foreign_id | weight | reps | time
____________________________________________
1 1 100 5 NULL
2 2 NULL 40 NULL
3 3 NULL NULL 10
4 4 50 60 NULL
And since in this case the id = foreign_id I think we can just append these columns to whatever table foreign_id is referring to.
I would assume most people would overwhelmingly say the latter approach is the accepted practice.
Is the former approach considered a bad idea, even though it doesn't result in any NULLs? What are the tradeoffs between these two approaches exactly? It seems like the former might be more versatile, at the expense of not really having a clear defined structure, but I don't know if this would actually result in other ramifications. I can imagine a situation where you have tons of columns in the latter example, most of which are NULL, and maybe only like three distinct values filled in.
EAV is the model your first example is in. It's got a few advantages, however you are in mysql and mysql doesn't handle this the best. As pointed out in this thread Crosstab View in mySQL? mysql lacks functions that other databases have. Postgres and other databases have some more fun functions PostgreSQL Crosstab Query that make this significantly easier. In the MSSQL world, this gets referred to as sparsely populated columns. I find columnar structures actually lend themselves quite well to this (vertica, or high end oracle)
Advantages:
Adding a new column to this is significantly easier than altering a table schema. If you are unsure of what future column names will be, this is the way to go
Sparsely populated columns result in tables full of nulls and redundant data. You can setup logic to create a 'default' value for a column...IE if no value is specified for this attribute, then use this value.
Downsides:
A bit harder to program with in MySQL in particular as per comments above. Not all SQL dev's are familiar with the model and you might accidentally implement a steeper learning curve for new resources.
Not the most scalable. Indexing is a challenge and you need work around (Strawberry's input in the comments is towards this, your value column is basically forced to Varchar and that does not index well, nor does it search easily...welcome to table scan hell) . Though you can get around this with a third table (say you query on dates like create date and close date alot. Create a third 'control' table that contains those frequently queried columns and index that...refer to the EAV tables from there) or creating multiple EAV tables, one for each data type.
First one is the right one.
If later you want change the number of properties, you dont have to change your DB structure.
Changing db structure can cause your app to break.
If the number of null is too big you are wasting lot of storage.
My take on this
The first I would probably use if I have a lot of different attributes and values I would like to add in a more dynamic way, like user tags or user specific information etc,
The second one I would probably use if I just have the three attributes (as in your example) weights, reps, time and have no need for anything dynamic or need to add any more attributes (if this was the case, I would just add another column)
I would say both works, it is as you yourself say, "the former might be more versatile". Both ways needs their own structure around them to extract, process and store data :)
Edit: for the first one to achieve the structure of the second one, you would have to add a join for each attribute you would want to include in the data extract.
I think the first way contributes better towards normalization. You could even create a new table with attributes:
id attribute
______________
1 reps
2 weight
3 time
And turn the second last column into a foreign id. This will save space and will save you the risk of mistyping the attribute names. Like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 2 100
2 1 1 5
3 2 1 40
4 3 3 10
5 4 2 50
6 4 1 60
As others have stated, the first way is the better way. Why? Well, it normalizes the structure. Reference: https://en.wikipedia.org/wiki/Database_normalization
As that article states, normalization reduces database size & allows for easy expansion.
Im working on a project. Its mostly for learning purposes, i find actually trying a complicated project is the best way to learn a language after grasping the basics. Database design is not a strong point, i started reading up on it but its early days and im still learning.
Here is my alpha schema, im really at the point where im just trying to jot down everything i can think of and seeing if any issues jump out.
http://diagrams.seaquail.net/Diagram.aspx?ID=10094#
Some of my concerns i would like feedback on:
Notice for the core attributes like area for example, lets say for simplicity the areas are kitchen,bedroom,garden,bathroom and living room. For another customer that might be homepage,contact page,about_us,splash screen. It could be 2 areas and it could be 100, there isn't a need to limit it.
I created separate tables for the defaults and each is linked to a bug. Later i came to the problem of custom fields, if someone wants for example to mark which theme the bug applies to we dont have that, there is probably a 100 other things so i wanted to stick to a core set of attributes and the custom fields give people flexibility.
However when i got to the custom fields i knew i had an issue, i cant be creating a table for every custom field so i instead used 2 tables. custom fields and custom_field_values. The idea is every field including defaults would be stored in this table and each would be linked to the values table which would just have something like this
custom_fields table
id project_id name
01 1 area(default)
12 2 rooms(custom)
13 4 website(custom)
custom_field_values table
id area project_id sort_number
667 area1 1 1
668 area2 1 2
669 area3 1 3
670 area4 1 4
671 bedroom 2 1
672 bathroom 2 2
673 garden 2 3
674 livingroom 2 4
675 homepage 4 1
676 about_us 4 2
677 contact 4 3
678 splash page 4 4
Does this look like an efficient way to handle dynamic fields like this or is there other alternatives?
The defaults would be hard coded so you can either use them or replace with your own or i could create a another table to allow users to edit the name of the defaults which would be linked to their project. Any feedback is welcome and if there something very obvious with issues in the scheme please feel free to critique.
You have reinvented an old antipattern called Entity-Attribute-Value. The idea of custom fields in a table is really logically incompatible with a relational database. A relation has a fixed number of fields.
But even though it isn't properly relational, we still need to do it sometimes.
There are a few methods to mimic custom fields in SQL, though most of them break rules of normalization. For some examples, see:
Product table, many kinds of product, each product has many parameters on StackOverflow
My presentation Extensible Data Modeling with MySQL
My book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming
I found this as was searching for something similar as the customer can submit custom fields for use later.
i settled for using data type JSON which i appreciate was not available when this question was asked.
I'm creating a DB for my office. We have about 200 employees. Each employee was required to complete at least 1 of 12 courses within 2 years of being hired (so different completion/qualification dates for every course, some people have been here 20 years, some just 1 year) to become qualified. Some have completed multiple courses. Each course has to be refreshed periodically (each refresh period is different and based on the last refresher date). I'm having trouble with the layout of the table. Here's what I have as an idea, but i'm trying to see if there is a less busy way to lay out the data. I want to be able to run a query that tells me what person has completed what class (so it would have to look at all 3 class columns). I also want to be able to tell when their qualification has lapsed, or is coming up. So far I've created an employee data table that looks like the table below.
ID Name Class1 Class2 Class3 QualDt-Cl1 QualDt-Cl2 QualDt-Cl3 LstRequal1 ...
1 Bob Art Spanish 3/17/1989 9/12/2010 3/8/2012
2 Sally Math 8/31/2012
3 George Physics History 2/6/2005 7/6/1996
4 Casey History 6/8/2000
5 Joe English Sports Physics 12/10/1993 10/15/2001 4/22/2006
The classes are listed in their own table and each class column pulls from that. The qual date refresher will be a calculated column in the query based on the last refresher date.
Is there a way to put all the classes one person is qualified for in one column and have the associated date for requalifiing for each particular cours in another column?
I think it would be less confusing if you had a table per subject and register the people's names under each one with the date passed.
Also it would probably help to declutter the table from uneccssary info like the exact date the exam was passed, you can do month and year or maybe just year? if the lee way is 2 years that would probably make more sense - also making the qulified calculation easier.
The query would work if you searched per subject maybe ? or who would qualify to do what subject this current year and then the next.
this is not much of a question that you would ask on here by the way - but hope the answer helps.
When designing a database, any time you find yourself adding columns with names like Class1, Class2, Class3 you should immediately stop and think about whether it makes more sense to put those columns in a separate child table called Classes with a link (relation) to the parent. There are several reasons for this, including:
What happens when somebody takes a fourth course? Saying "that will never happen" ignores the fact that "never is a very long time" and none of us can predict the future.
When checking whether or not someone has taken a course you really need to check (Class1 IS NULL) OR (Class2 IS NULL) OR (Class3 IS NULL) and that can get really tedious, It also means that if you do have to add Class4 then all of that SQL code has to be corrected.
Similarly, if you want to find someone who took "CPR" you'd have to look for people with (Class1 = 'CPR') OR (Class2 = 'CPR') OR (Class3 = 'CPR'). Yuck.
So, save yourself some trouble (a lot of trouble, really) and create a Classes table:
ID
ClassName
QualDate
(etc. )
...where ID is the ID number from the main table (what is called a "foreign key"). From your sample data, your Classes table would look something like this:
ID ClassName QualDate
1 Art 3/17/1989
1 Spanish 9/12/2010
2 Math 8/31/2012
3 Physics 2/6/2005
3 History 7/6/1996
...