How do you know when you need separate tables? - mysql

How do you know when to create a new table for very similar object types?
Example:
To learn mysql I'm building a model solar system. For the purposes of my project, planets have many similar attributes to dwarf planets, centaurs, and comets. Dwarf planets are almost completely identical to planets. Centaurs and comets are only different from planets because their orbital path has more variation. Should I have a separate table for each type of object, or should they share tables?
The example is probably too simple, but I'm also interested in best practices. Like should I use separate tables just in case I want to make planets and dwarf planets different in the future, or are their any efficiency reasons for keeping them in the same table.

Normal forms is what you should be interested with. They pretty much are the convention for building tables.
Any design that doesn't break the first, second or third normal form is fine by me. That's a pretty long list of requirement though, so I suggest you go read it off the Wikipedia links above.

It depends on what type of information you want to store about the objects. If the information for all of them is the same, say orbit radius, mass and name, then you can use the same table. However, if there are different properties for each (say atmosphere composition for planets, etc.) then you can either use separate tables for each (not very normalized) or have one table for basic properties like orbit, mass and name and a second table for just the properties that are unique to planets (and a similar table for comets, etc. if needed). All objects would be in the first table but only planets would be in the second table and linked through a foreign key to the first table.

It's called Database Normalization
There are many normal forms. By applying normalization you will go through metadata (tables) and study the relationsships between data more clearly. By using the normalization techniques you will optimize the tables to prevent redundancy. This process will help you understand which entities to create based on the relationsships between the different fields.

You should most likely split the data about a planet etc so that the shared (common) information is in another table.
E.g.
Common (Table)
Diameter (Column)
Mass (Column)
Planet
Population
Comet
Speed
Poor columns I know. Have the Planet and Comet tables link to the Common data with a key.

This is definitely a subjective question. It sounds like you are already on the right lines of thinking. I would ask:
Do these objects share many attributes? If so, it's probably worth considering at the very least a base table to list them all in.
Does one object "extend" another - it has all the attributes of the other, plus some extras? If so, it might be worth adding another table with the extra attributes and a one-to-one mapping back to the base object.
Do both objects have many shared attributes and unshared attributes? If this is the case, maybe you need a single table plus a "data extension" system where each object can have a type or category that specifies any amount of extra attributes that may be associated with it.
Do the objects only share one or two attributes? In this case, they are probably dissimilar enough to separate into multiple tables.
You may also ask yourself how you are going to query the data. Will you ever want to get them all in the same list? It's always a good idea to combine data into tables with other data they will commonly be queried with. For example, an "attachments" table where the file can be an image or a video, instead of images and video tables, if you commonly want to query for all attachments. Don't split into multiple tables unless there is a really good reason.

If you will ever want to get planets and comets in one single query, they will pretty much have to be in the same table if you want the database to work efficiently. Inheritance should be handled inside your app itself :)

Here's my answer to a similar question, which I think applies here as well:
How do you store business activities in a SQL database?

There are many different ways to express inheritance in your relational model. For example you can try to squish everything in to one table and have a field that allows you to distinguish between the different types or have one table for the shared attributes with relationships to a child table with the specific attributes etc... in either choice you're still storing the same information. When going from a domain model to a relational model this is what is called an impedance mismatch. Both choices have different trade offs, for example one table will be easier to query, but multiple tables will have higher data density.
In my experience it's best not to try to answer these questions from a database perspective, but let your domain model, and sometimes your application framework of choice, drive the table structure. Of course this isn't always a viable choice, especially when performance is concerned.
I recommend you start by drawing on paper the relationships you want to express and then go from there. Does the table structure you've chosen represent the domain accurately? Is it possible to query to extract the information you want to report on? Are the queries you've written complicated or slow? Answering these questions and others like them will hopefully guide you towards creating a good relational model.
I'd also suggest reading up on database normalization if you're serious about learning good relational modeling principals.

I'd probably have a table called [HeavenlyBodies] or some such thing. Then have a look up table with the type of body, ie Planet, comet, asteroid, star, etc. All will share similar things such as name, size, weight. Most of the answers I read so far all have good advise. Normalization is good, but I feel you can take it too far sometimes. 3rd normal is a good goal.

Related

(Somewhat) complicated database structure vs. simple — with null fields

I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system

DB table organization by entity, or vertically by level of data?

I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.

SQL one-to-one relationships vs flattening

I'm using a standard SQL database and I'm trying to figure out whether or not to flatten a table or make it more "object-oriented". To me, smaller tables are easier to read but it would require joining tables and having one-to-one relationships. Is this generally a good way of doing things or is it frowned on in the SQL world?
I have a table which has the following attributes:
MYTABLE
- ID
- NAME
- LABEL
- CREATED_TS
- MODIFIED_TS
- CREATED_USER
- MODIFIED_USER
To me, the created/modified fields would be their own object. There are actually a few more fields as well so it's not really just this small. I would think that creating another table called "MYTABLE_MODINFO" or something like that which would have the CREATED and MODIFIED fields and they would be joined when data from them was needed. These tables aren't high access tables, they wouldn't have tons of queries per minute or even hundreds of rows in them, so I don't think efficiency would be much of an issue.
So mainly what I'm wondering is would this be a generally accepted design or should you generally keep your table structures flat?
You should create audit information in the same table. The reason is that this data is part of the row and is a one to one relationship, so there is no point in branching it apart.
If you want to store the audit info (audit tracking/history), then you can create another table, however in most cases I have seen this built by "duplicating" data and creating a surrogate key and mappings back to the original row. The reason I list duplicating in quotes is because auditing inherently requires duplication of the old data...if it is linked and changeable after being written, then it is not really an audit.
Just my two cents. If it does not make sense, then I can provide some examples. But, the gist is that each row will only ever have one current piece of modification information, so why break it out if it will never have more than one?
avoid a database 'one to one', you'll lose performance, scalability, independence. can you imagine what happen if you want to store 2 pictures per ID? will you create another field or will you repeat the row??... it's easier to create relationship to have more freedom when you want to upgrade, please review this tutorials.
http://www.youtube.com/watch?v=Onzm-PxSjtE
http://folkworm.ceri.memphis.edu/ew/SCHEMA_DOC/comparison/erd.htm
http://www.visual-paradigm.com/product/vpuml/provides/dbmodeling.jsp
Beside that you should normalize the DB to be sure that everything is in the best shape possible. Remember that the most important is to take what you need and adapt it.
http://databases.about.com/od/specificproducts/a/normalization.htm
http://www.youtube.com/watch?v=xzeuBwHkKxw
RDBMS design aren't the same with object-oriented approach in my view. the example you mentioned aren't different objects domain but data inheritance of your record. Since there would not be any overhead of tons of queries/execution of the table so you should keep them in the same table for auditing purpose and also easier to work with at normalize data.

'Many to two' relationship

I am wondering about a 'many to two' relationship. The child can be linked to either of two parents, but not both. Is there any way to reinforce this? Also I would like to prevent duplicate entries in the child.
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This question shows that you don't fully understand entity relationships (no rudeness intended). Of which there are four (technically only 3) types below:
One to One
One to Many
Many to One
Many to Many
One to One (1:1):
In this case a table has been broken up into two parts for purposes of complying with normalisation, or more usually the open closed principle.
Normalisation compliance: You might have a business rule that each customer has only one account. Technically, you could in this case say customer and account could all be in the same table, but this breaks the rules of normalisation, so you split them and make a 1:1.
Open-Close principle compliance: A customer table, might have id, first & last names, and address. Later someone decides to add a date of birth and with it the ability to calculate age along with a bunch of other much needed fields. This is an over simplified example of one to one, but you get the main use for it is to extend your database without breaking existing code. Much code written (sadly) is tightly coupled to the database so changes in the structure of a table will break the code. Adding a 1:1 like this will extend the table to meet new requirements without modifying the origional, thereby allowing old code to continue functioning normally and new code to make use of the new db features.
The downside of normalisation and extending tables using 1:1 relationships in this way is performance. Often times on heavly used systems, the first target to increase database performance is de-normalising and combining such tables into a single table, and optimising the indexes thus removing the need to use joins and read from multiple tables. Normalisation / De-Normalisation is neither a good or bad thing, as it depends on the needs of the system. Most systems usually start off normalised changing back when needed, but this change needs to be done very carefully as mentioned, if code is tightly coupled to the DB structure, it will almost definitely cause the system to fail. i.e. When you combine 2 tables, one ceases to exist, all the code that includes that now nonexistant table fails until it is modified (in db terms, imagine connecting relationships to any of the tables in the 1:1, when you remove those tables, this breaks the relationships, and so the structure has to be greatly modified to compensate. Unfortunately, such bad designs are much easier to spot in the DB world than in the software world in most cases and you don't usually notice something went wrong in code until it all falls apart) unless the system is properly designed with separation of concerns in mind.
It the closest thing you can get to inheritance in object oriented programming. But its not quite the same.
One to Many (1:M) / Many to One (M:1):
These two relationships (hense why 4 become 3), are the most popular relationship types. They are both the same type of relationship, the only thing that changes is your point of view. An example A customer has many phone numbers, or alternately, many phone numbers can belong to a customer.
In object oriented programming this would be considered composition. Its not inheritance, but you are saying one item is composed of many parts. This is usually represented with arrays / lists / collections etc. inside of classes as opposed to an inheritance structure.
Many to Many (M:M):
This type of relationship with current technology is impossible. For this reason we need to break it down into two one to many relationships with an "association" table joining them. The many side of the two one to many relationships is always on the association / link table.
For your example, the person who said you need a many to many is correct. Because a two to many is effectively a many (meaning more than one) to many relationship. This is the only way you would get your system to work. Unless you are intending to research the field of relational calculus to find some new type of relationship that would allow this.
Also for such relationships (m2m) you have two choices, either create a compound key in the linker table so the combination of fields become a unique entry (if you are interested in db optimisation this is the slower choice, but takes less space). Alternately, you create a third field with an auto generated id column and make that the primary key (for db optimisation, this is the faster choice, but takes more space).
In your example specifically above...
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This would be a many to many relationship with the phone number table as the linker table between companies and users. As explained, to ensure no phone number is repeated, you simply set it as the primary key or use another primary key and set the phone number field to unique.
For those kind of questions, it is really down to how you phrase them. What is causing you to get confused about this, and how you overcome this confusion to see the solution is simple. Rephrase the problem as follows. Start by asking is it a one to one, if the answer is no, move on. Next ask is it a one to many, if the answer is no move on. The only other option remaining is many to many. Be careful though, ensure you have considered the first 2 questions carefully before moving on. Many inexperienced database people often over complicate issues by defining one to many as many to many. Once again, the most popular type of relationship by far is one to many (I would say 90%) with the many to many and one to one spliting the remaining 10% 7/3 respectevely. But those figures are just my personal perspective, so dont go quoting them as industry standard statistics. My point is to make extra extra sure it is definitely not a one to many before choosing many to many. It is worth the extra effort.
So now to find the linker table between the two, decide which two are your main tables, and what fields need to be shared between them. In this case, company and user tables both need to share the phone. Hense you need to make a new phone table as the linker.
The warning alarm of misunderstanding should show as soon as you decide none of the 3 are working for you. This should be enough to tell you that you simply are not phrasing the relationship question correctly. You will get better at it as time passes, but it is an essential skill and really should be mastered as soon as possible for your own sanaty.
Of course you could also go to an object oriented database which will allow a range of other relationships called "Hierarchacal" relationships. Thats great if you are thinking of becomming a programmer too. But I wouldnt recommend this as it going to make your head hurt when you start finding ways to combine the various types of relationships. Especially given there is not much need since nearly all databases in the world consist of just those 3 types of relationships unless they are something super duper special.
Hope this was a reasonable answer. Thanks for taking the time to read it.
Just make phone number a key in your contact numbers table.
For your phone number example, you would put the phone number in a table by itself, with an ID.
Then you link to that phone_id from each of users and companies.
For your parents example, you don't link the child to parent - instead you link the parent to the child. OR, you put both parents in the same table, and the child just links to one of them.

Database schema - organise by object or data?

I'm refactoring a horribly interwoven db schema, it's not that it's overly normalised; just grown ugly over time and not terribly well laid out.
There are several tables (forum boards, forum posts, idea posts, blog entries) that share virtually identical data structures and composition, but are seperated simply because they represent different "objects" from the applications perspective. My initial reaction is to put everything that has the same data structure into the same table, and use a "type" column to distinguish data when performing a select.
Am I setting myself up for a fall by adopting this "all into one" approach and allowing (potentially) so many parts of the application to access the same table? FYI, I can't see this database growing to more than ~20mb over the next year or so...
There's basically three ways to store an object inheritance hierarchy in a relational database. Each has their own pros and cons. See:
http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
http://www.martinfowler.com/eaaCatalog/classTableInheritance.html
http://www.martinfowler.com/eaaCatalog/concreteTableInheritance.html
The book is great too. Luck would have it that chapter 3 - "Mapping to Relational Databases" - is available freely as a sample chapter. You can read more about the tradeoffs in there.
I used to dislike this "all into one" approach, but after I was forced to use it on a complex project a few years ago, I became a fan. If you index the table correctly, performance should be OK. You'll want an index on the type column to speed up your sort by type operations, for instance.
I now usually recommend that you use a single table to store similar objects. The only question then, is, do you want to use subtables to store data that's specific to a certain type of object? The answer to this question really depends on how different the structure of each object type is, and how many object types you'll have. If you have 50 object types with vastly differing structures, you may want to consider storing just the consistent object parts in the main table and creating a sub table for each object type.
In your example, however, I think you'd be fine just putting it all into a single table.
For more info, see here: http://www.agiledata.org/essays/mappingObjects.html
Don't lean too much on the "applications perspective", it tends to vary over time anway. Often databases are accessed by different applications too, and it usually outlives them all ...
When simliar objects are stored in different tables the reason may be that they actually represent the same domain object, but in a different state, or in a different step in a workflow. Then it often makes sense to keep them in one table and add some simple attributes to flag the state. If the workflow, or whatever it is changes, it's easier to change the database and application too, you may not need to add more tables or classes.