I want to implement a vote system for several different entities/tables (e.g. articles, blog posts, users).
What is the best/more efficient approach?:
Create a table votes to store all the votes of all entities?
votes
vote_id
user_id
type (articles, blogposts or users)
Create a table votes for each entity? votes_articles, votes_blogposts, votes_users
What I see is:
First option will result with a bigger table and there's an additional field which I need to include in my queries. More generic table that can be easily extended for more entities if needed and everything is kind of centralised. (Can use a generic function to retrieve/insert/update the table.)
Second option will result with smaller tables; faster to query? But not necessarily better to maintain.
The second method has many advantages. Presumably, the votes are actually on entities, so you also have an id in each table pointing to the article, blogpost, or whatever that is being voted on. In a standard SQL database, you would like to have foreign key references to other tables, and the one-table-per-entity approach provides that capability.
You could modify the first approach to do this. However, that would require a separate column for each possible entity. And, then, you lose the easy flexibility of adding new entities.
When is the first approach advantageous? First, when maintaining valid foreign key references is not important. And, when you often want to bring together votes as votes. So, how many times did a user vote today regardless of what s/he voted on? How many votes do user A and user B have in common regardless of what they voted on? Get the idea. If votes starts to behave like its own entity, then it deserves its own table.
I happen to think that your very question highlights a major weakness in SQL and relational databases. This is an example of wanting different entities to "inherit" features from a class (to borrow terminology from the OO world). Wouldn't it be nice if you could just specify that a new entity inherits properties from another entity (such as "Votable")? Oh, never mind, that's not the real world of popular databases. At least not today.
EDIT:
If you care about performance, don't go with the modified first approach -- that is, a separate column for each possible entity. Normally, primary keys are 4-byte integers. These (in most databases at least) will occupy four bytes, regardless of whether the column has a NULL value. So, one table with three entity columns is (to a very rough approximation) three times the size of three tables specialized for each entity. Such wasted space only slows down the query processing.
If you are only going to have two or three entities, maybe this isn't that big a deal. But once you get to more than you can count on one hand, it really is a waste of space, memory, and processing power.
Related
I have three tables with common fields - users, guests and admins.
The last two tables have some of the users fields.
Here's an example:
users
id|username|password|email|city|country|phone|birthday|status
guests
id|city|country|phone|birthday
admins
id|username|password|status
I'm wondering if it's better to:
a)use one table with many NULL values
b)use three tables
The question is less about "one table with many NULL versus three tables" that about the data structure. The real question is how other tables in your data structure will refer to these entities.
This is a classic situation, where you have "one-of" relationships and need to represent them in SQL. There is a "right" way, and that is to have four tables:
"users" (I can't think of a good name) would encompass everyone and have a unique id that could be referenced by other tables
"normal", "admins", "guests" each of which would have a 1-0/1 relationship with "users"
This allows other tables to refer to any of the three types of users, or to users in general. This is important for maintaining proper relationships.
You have suggested two shortcuts. One is that there is no information about "normal" users so you dispense with that table. However, this means that you can't refer to "normal" users in another table.
Often, when the data structures are similar, the data is simply denormalized into a single row (as in your solution a).
All three approach are reasonable, in the context of applications that have specific needs. As for performance, the difference between having additional NULLABLE columns is generally minimal when the data types are variable length. If a lot of the additional columns are numeric, then these occupy real space even when NULL, which can be a factor in designing the best solution.
In short, I wouldn't choose between the different options based on the premature optimization of which might be better. I would choose between them based on the overall data structure needed for the database, and in particular, the relationships that these entities have with other entities.
EDIT:
Then there is the question of the id that you use for the specialized tables. There are two ways of doing this. One is to have a separate id, such as AdminId and GuestId for each of these tables. Another column in each table would be the UserId.
This makes sense when other entities have relationships with these particular entities. For instance, "admins" might have a sub-system that describes rights and roles and privileges that they have, perhaps along with a history of changes. These tables (ahem, entities) would want to refer to an AdminId. And, you should probably oblige by letting them.
If you don't have such tables, then you might still split out the Admins, because the 100 integer columns they need are a waste of space for the zillion other users. In that case, you can get by without a separate id.
I want to emphasize that you have asked a question that doesn't have a "best" answer in general. It does have a "correct" answer by the rules of normalization (that would be 4 tables with 4 separate ids). But the best answer in a given situation depends on the overall data model.
Why not have one parent user table with three foreign keyed detail tables. Allows unique user id that can transition.
I generally agree with Chriseyre2000, but in your specific example, I don't see a need for the other 2 tables. Everything is contained in users, so why not just add Guest and Admin bit fields? Or even a single UserType field.
Though Chriseyre2000's solution will give you better scalability should you later want to add fields that are specific to guests and admins.
How to store records that don't contain any fact? For example, let's say that a shop wants to count how many people have entered inside a store (and that they take info on every person that goes inside the shop). In warehouse, I guess there would be dimension table "Person" with different attributes, but how would fact table look like? Would it contain only foreign keys?
As you described it, that would be just a fact table. Actually, there is name for this -- factless fact table; fact table without any measures.
It is quite common for recoding events. Essentially anything that records: who, what, where, when and why? would be fact-measure-less table. If you add how much? then that goes into a measure.
You can think of it as the fact table containing an implicit count column, with the number of people entering, which is always "1" if you store data on individual person level, so that makes a fact table containing only FKs to the dimensions.
This of course only enabled analysis on the number of people entering, filtered for the various dimensions, but it seams like a realistic use case to me. I think you're on the right way.
In my database I have different entities like todos, events, discussions, etc. Each of them can have tags, comments, files, and other related items.
Now I have to design the relationships between these tables, and I think I have to choose from the following two possible solutions:
1. Separated relationship tables
So I will create todos_tags, events_tags, discussions_tags, todos_comments, events_comments, discussions_comments, etc. tables.
2. Common relationship tables
I will create only these tables: related_tags, related_comments, related_files, etc. having a structure like this:
related_tags
entity (event|discussion|todo|etc. - as enum or tinyint (1|2|3|etc.))
entity_id
tag_id
Which design should I use?
Probably you will say: it depends on the situation, and I think this is correct.
I my case most of the time (maybe 70%+) I will have to query only one of the entities (events, discussion or todos), but in some cases I need them all in the same query (both events, discussion, todos having a specified tag for example). In this case I'll have to do on union on 3+ tables (in my case it can be 5+ tables) if I go with separated relationship tables.
I'll not have more than 1000-2000 rows in each table(events, discussions, todos);
What is the correct way to go? What are some personal experiences about this?
The second schema is more extensible. This way you will be able to extend your application to construct queries involving more than one type. In addition, it's possible to easily add new types to the future even dynamically. Furthermore, it allows greater aggregation freedom, for example allowing you to count how many rows in each type exist, or how many were created during a particular timeframe.
On the other hand, the first design does not really have many advantages other than speed: But MySQL is already good at handling these types of queries fast enough for you. You can create an index "entity" to make it work smoothly. If in the future you need to partition your tables to increase speed, you can do so at a later stage.
It is a far simpler design to have a single, common relationship table such as related_tags where you specify the entity type in a column rather than having multiple tables. Just be sure you properly index the entity and tag_id fields together to have optimum performance.
Why would anyone distribute an entity (for example user) into multiple tables by doing something like:
user(user_id, username)
user_tel(user_id, tel_no)
user_addr(user_id, addr)
user_details(user_id, details)
Is there any speed-up bonus you get from this DB design? It's highly counter-intuitive, because it would seem that performing chained joins to retrieve data sounds immeasurably worse than using select projection..
Of course, if one performs other queries by making use only of the user_id and username, that's a speed-up, but is it worth it? So, where is the real advantage and what could be a compatible working scenario that's fit for such a DB design strategy?
LATER EDIT: in the details of this post, please assume a complete, unique entity, whose attributes do not vary in quantity (e.g. a car has only one color, not two, a user has only one username/social sec number/matriculation number/home address/email/etc.. that is, we're not dealing with a one to many relation, but with a 1-to-1, completely consistent description of an entity. In the example above, this is just the case where a single table has been "split" into as many tables as non-primary key columns it had.
By splitting the user in this way you have exactly 1 row in user per user, which links to 0-n rows each in user_tel, user_details, user_addr
This in turn means that these can be considered optional, and/or each user may have more than one telephone number linked to them. All in all it's a more adaptable solution than hardcoding it so that users always have up to 1 address, up to 1 telephone number.
The alternative method is to have i.e. user.telephone1 user.telephone2 etc., however this methodology goes against 3NF ( http://en.wikipedia.org/wiki/Third_normal_form ) - essentially you are introducing a lot of columns to store the same piece of information
edit
Based on the additional edit from OP, assuming that each user will have precisely 0 or 1 of each tel, address, details, and NEVER any more, then storing those pieces of information in separate tables is overkill. It would be more sensible to store within a single user table with columns user_id, username, tel_no, addr, details.
If memory serves this is perfectly fine within 3NF though. You stated this is not about normal form, however if each piece of data is considered directly related to that specific user then it is fine to have it within the table.
If you later expanded the table to have telephone1, telephone2 (for example) then that would violate 1NF. If you have duplicate fields (i.e. multiple users share an address, which is entirely plausible), then that violates 2NF which in turn violates 3NF
This point about violating 2NF may well be why someone has done this.
The author of this design perhaps thought that storing NULLs could be achieved more efficiently in the "sparse" structure like this, than it would "in-line" in the single table. The idea was probably to store rows such as (1 , "john", NULL, NULL, NULL) just as (1 , "john") in the user table and no rows at all in other tables. For this to work, NULLs must greatly outnumber non-NULLs (and must be "mixed" in just the right way), otherwise this design quickly becomes more expensive.
Also, this could be somewhat beneficial if you'll constantly SELECT single columns. By splitting columns into separate tables, you are making them "narrower" from the storage perspective and lower the I/O in this specific case (but not in general).
The problems of this design, in my opinion, far outweigh these benefits.
Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.