Many null values in one table vs three tables - mysql

I have three tables with common fields - users, guests and admins.
The last two tables have some of the users fields.
Here's an example:
users
id|username|password|email|city|country|phone|birthday|status
guests
id|city|country|phone|birthday
admins
id|username|password|status
I'm wondering if it's better to:
a)use one table with many NULL values
b)use three tables

The question is less about "one table with many NULL versus three tables" that about the data structure. The real question is how other tables in your data structure will refer to these entities.
This is a classic situation, where you have "one-of" relationships and need to represent them in SQL. There is a "right" way, and that is to have four tables:
"users" (I can't think of a good name) would encompass everyone and have a unique id that could be referenced by other tables
"normal", "admins", "guests" each of which would have a 1-0/1 relationship with "users"
This allows other tables to refer to any of the three types of users, or to users in general. This is important for maintaining proper relationships.
You have suggested two shortcuts. One is that there is no information about "normal" users so you dispense with that table. However, this means that you can't refer to "normal" users in another table.
Often, when the data structures are similar, the data is simply denormalized into a single row (as in your solution a).
All three approach are reasonable, in the context of applications that have specific needs. As for performance, the difference between having additional NULLABLE columns is generally minimal when the data types are variable length. If a lot of the additional columns are numeric, then these occupy real space even when NULL, which can be a factor in designing the best solution.
In short, I wouldn't choose between the different options based on the premature optimization of which might be better. I would choose between them based on the overall data structure needed for the database, and in particular, the relationships that these entities have with other entities.
EDIT:
Then there is the question of the id that you use for the specialized tables. There are two ways of doing this. One is to have a separate id, such as AdminId and GuestId for each of these tables. Another column in each table would be the UserId.
This makes sense when other entities have relationships with these particular entities. For instance, "admins" might have a sub-system that describes rights and roles and privileges that they have, perhaps along with a history of changes. These tables (ahem, entities) would want to refer to an AdminId. And, you should probably oblige by letting them.
If you don't have such tables, then you might still split out the Admins, because the 100 integer columns they need are a waste of space for the zillion other users. In that case, you can get by without a separate id.
I want to emphasize that you have asked a question that doesn't have a "best" answer in general. It does have a "correct" answer by the rules of normalization (that would be 4 tables with 4 separate ids). But the best answer in a given situation depends on the overall data model.

Why not have one parent user table with three foreign keyed detail tables. Allows unique user id that can transition.

I generally agree with Chriseyre2000, but in your specific example, I don't see a need for the other 2 tables. Everything is contained in users, so why not just add Guest and Admin bit fields? Or even a single UserType field.
Though Chriseyre2000's solution will give you better scalability should you later want to add fields that are specific to guests and admins.

Related

Generic votes table vs separate votes tables?

I want to implement a vote system for several different entities/tables (e.g. articles, blog posts, users).
What is the best/more efficient approach?:
Create a table votes to store all the votes of all entities?
votes
vote_id
user_id
type (articles, blogposts or users)
Create a table votes for each entity? votes_articles, votes_blogposts, votes_users
What I see is:
First option will result with a bigger table and there's an additional field which I need to include in my queries. More generic table that can be easily extended for more entities if needed and everything is kind of centralised. (Can use a generic function to retrieve/insert/update the table.)
Second option will result with smaller tables; faster to query? But not necessarily better to maintain.
The second method has many advantages. Presumably, the votes are actually on entities, so you also have an id in each table pointing to the article, blogpost, or whatever that is being voted on. In a standard SQL database, you would like to have foreign key references to other tables, and the one-table-per-entity approach provides that capability.
You could modify the first approach to do this. However, that would require a separate column for each possible entity. And, then, you lose the easy flexibility of adding new entities.
When is the first approach advantageous? First, when maintaining valid foreign key references is not important. And, when you often want to bring together votes as votes. So, how many times did a user vote today regardless of what s/he voted on? How many votes do user A and user B have in common regardless of what they voted on? Get the idea. If votes starts to behave like its own entity, then it deserves its own table.
I happen to think that your very question highlights a major weakness in SQL and relational databases. This is an example of wanting different entities to "inherit" features from a class (to borrow terminology from the OO world). Wouldn't it be nice if you could just specify that a new entity inherits properties from another entity (such as "Votable")? Oh, never mind, that's not the real world of popular databases. At least not today.
EDIT:
If you care about performance, don't go with the modified first approach -- that is, a separate column for each possible entity. Normally, primary keys are 4-byte integers. These (in most databases at least) will occupy four bytes, regardless of whether the column has a NULL value. So, one table with three entity columns is (to a very rough approximation) three times the size of three tables specialized for each entity. Such wasted space only slows down the query processing.
If you are only going to have two or three entities, maybe this isn't that big a deal. But once you get to more than you can count on one hand, it really is a waste of space, memory, and processing power.

Enhancing the Database Design

My CMS Application which lets users to post Classifieds, Articles, Events, Directories, Properties etc has its database designed as follows :
1st Approach:
Each Section (i.e 'classifieds','events' etc) has three tables dedicated to store data relevant to it:
Classified:
classified-post
classified-category
classified-post-category
Event:
events_post.
events_category.
events_post-category.
The same applies for Articles, Properties, Directories etc. each Section has three tables dedicated to its posts, categories.
The problem with this approach is:
Too many database table. (which leads to increasing number of model,
controller files)
Two Foreign Key's to avoid duplicate entries in associative tables.
For example: Lets say table comments, ratings, images belongs to classified-post, events-posts etc, so the structure of the tables would be:
Image [id, post_id, section]
The second FK section must be stored and associated to avoid duplicate posts.
2nd Approach:
This approach will have single posts table which has section column associated to each posts as foreign key. i.e
post: id, section, title etc ....VALUES ( 1, 'classifieds','abc') (2,'events','asd')
While the second approach is little bit cumbersome when doing sql queries, it eases up the process when performing relational table queries. ex: table images, ratings, comments belongs to posts table.
image [ id, post_id (FK) ]
While this approach seems clean and easy, this will end up in having oodles of columns in posts table, that it will have columns related to events, classifieds, directories etc which will lead to performance issues while querying for rows and columns.
The same applies for categories. It could be either one of the two approach, either save section column as second foreign key or have separate tables for each sections ( 1st approach ).
So now my question is, which approach is considered to be better than the other? does any of the two approaches have benefit over the other in performance wise? or what is the best approach to tackle while dealing with these paradigms?
I will favor second approach with some considerations.
A standard database design guidance is that the designer should first create a fully normalized dsign then selective denormalization can be performed for performance reasons.
Normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.
Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data.
Hint: Programmers building their first database are often primarily concerned with performance. There’s no question that performance is important. A bad design can easily result in database operations that take ten to a hundred times as much time as they should.
A very likely example could be seen here
A draft model following the mentioned approach could be:
Approach 1 has the problem of too many tables
Approach 2 has too many columns
Consider storing your data on a single table like Approach 2, but dividing storing all the optional foreign key data in XML.
The XML field will only have data that it needs for a particular section. If a new section is added, then you just add that kind of data to the XML
Your table may look like
UserID int FK
ImageID int FK
ArtifactCategory int FK
PostID int FK
ClassifiedID int FK
...
Other shared
...
Secondary xml
Now you have neither too many columns nor too many tables

Mysql: separate or common relationship tables for different entities

In my database I have different entities like todos, events, discussions, etc. Each of them can have tags, comments, files, and other related items.
Now I have to design the relationships between these tables, and I think I have to choose from the following two possible solutions:
1. Separated relationship tables
So I will create todos_tags, events_tags, discussions_tags, todos_comments, events_comments, discussions_comments, etc. tables.
2. Common relationship tables
I will create only these tables: related_tags, related_comments, related_files, etc. having a structure like this:
related_tags
entity (event|discussion|todo|etc. - as enum or tinyint (1|2|3|etc.))
entity_id
tag_id
Which design should I use?
Probably you will say: it depends on the situation, and I think this is correct.
I my case most of the time (maybe 70%+) I will have to query only one of the entities (events, discussion or todos), but in some cases I need them all in the same query (both events, discussion, todos having a specified tag for example). In this case I'll have to do on union on 3+ tables (in my case it can be 5+ tables) if I go with separated relationship tables.
I'll not have more than 1000-2000 rows in each table(events, discussions, todos);
What is the correct way to go? What are some personal experiences about this?
The second schema is more extensible. This way you will be able to extend your application to construct queries involving more than one type. In addition, it's possible to easily add new types to the future even dynamically. Furthermore, it allows greater aggregation freedom, for example allowing you to count how many rows in each type exist, or how many were created during a particular timeframe.
On the other hand, the first design does not really have many advantages other than speed: But MySQL is already good at handling these types of queries fast enough for you. You can create an index "entity" to make it work smoothly. If in the future you need to partition your tables to increase speed, you can do so at a later stage.
It is a far simpler design to have a single, common relationship table such as related_tags where you specify the entity type in a column rather than having multiple tables. Just be sure you properly index the entity and tag_id fields together to have optimum performance.

For storing people in MySQL (or any DB) - multiple tables or just one?

Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.

Foreign key column optionally contains NULL or ID. Is there a better design?

I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?
I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.
Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.
You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.
I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.
NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.