Normalization and Unique Ids

Normalization and Unique Ids - mysql

I have a table titled videos. In it there are three columns: media_id, project_id, and video_url. My questions is, is it necessary for me to have media_id? I'm not using it in any other tables. I would expect there to be multiple project_ids with the same number but different video_urls.

Having or not having surrogate ID's for something has nothing to do with normalization.
(copyright catcall)
Having or not having surrogate ID's for something depends on whether or not you have a useful use for it. You already gave the answer to that yourself. And it depends on whether or not there is a significant likelihood that, even if there is no actual use for it right now, such a use might quickly emerge in a nearby future.

You could use project_id and video_url as a function dependency key in your model but at a physical level I would not like to use a URL as part of a key.
By this I mean I prefer an ID or number to avoid typing in long string each time the key is referenced in different tables.

I would consider it necessary. This is purely based on the fact that the media entry is unique and there could be multiple media entries for any one project. This keeps a unique id for the row, a proper project relationship and the valuable URL data for the media resource.

Related

Is it better to have userID or username in updatedBy field in database

I have fields in my almost all tables like createdBy or updatedBy.
I think that's just for reference.
Do you think I should enter username there or userID. Because if i need to look at database directly then that can give better understanding or it's a bad practice.

always use foreign keys to store referential records, that is userID in your case.
and regarding the approach on how to store, it depends on what you need.
a) if you want to know who last updated the record. then you should
create a userID column in the table.
it is always good to store foreign keys instead of other records, because this way you can relate and fetch all the records of a user. this approach will have a limitation though, since you can store only one userID, you can only know who last updated it.
b) if you want to store all the records, to know which user updated
the records and when, then you should store it in one to many
relationship table. for example
user_log with columns user_id, update_datetime and perhaps a message column, telling what did the user do.

UserID.
Because they are smaller and faster than usernames.
suppose your users want to change the username then you will not need to update all tables which is extremely efficient

always use ID's to keep a normalized relational data structure. this will provide better performance and much more scalability. If you can include constraints it will make it that much more cleaner.

It is not always bad. Defends on you application need. Normalisation is good to remove redundency. Whereas if speed is the factor you can keep it as it is. As join takes time. Also inserting data means inserting into two tables.
Never the less, always +1 for normalisation, by the book :)

Use something that cannot be changed later. Usually that is true of the user_id.
In special cases you may want to store the name in addition (to be able to display the name of the user at the time, before she married, or the name of a user that has since been deleted). But normally, you query the database again for the (current) name (which can also be cached easily).

In the case of usernames, surrogate keys tends to be a better choice. So, in your case, FKs (createdBy and updatedBy) will reference the surrogate key (userID) and not the natural key (username).
However, this doesn't mean surrogate is always better than natural key: consider this list of criteria.

Database architecture

For example, I have posts, answers for post and comments for answers.
What better: to have each table for each entity (posts, answers, comments) or one table with 'post_type' and 'parent_id' parameter?
UPD: answers and comments are similar with properties.

The best route would be 3 tables:
Posts
Answers
Posts_ID (parent)
Comments
Answers_ID (parent)
Posts being the ultimate parent, answers linked to it, and comments linked to answers.

Whether you have separate tables for posts and comments depends partly on whether one can be used instead of the other and partly on how similar is the processing on each. Put another way, how much information is unique to each? If there is nothing or almost nothing unique to each, then one table would very probably work.

Under certain circumstances it makes sense to use a single table with a differentiator column (option 2) BUT any columns that belong solely to one type should go into its own table, and that table will have a foreign key mapping back to the main table. This will form a table inheritance hierarchy.
This makes sense if the objects these tables represent share a significant number of columns; and their commonalities allows for some processing to be generalised across the different types without having to know the specific subtype.
If the majority of your queries end up having to filter by the differentiator column, what real advantage do you gain from storing them in the same place?
In short: K.I.S.S. applies; if there is a programming advantage to be had by using the single table approach, use that; otherwise keep it simple and use 3 (counter intuitive I know, but the added cognitive load just to do self-joins all the time should convince you).

Mysql - Should I use ID columns?

I have a doubt about best practices and how the database engine works.
Suppose I create a table called Employee, with the following columns:
SS ID (Primary Key)
Name
Sex
Age
The thing is.. I see a lot of databases that all its tables has and aditional column called ID, wich is a sequencial number. Should I put and ID field in my table here? I mean, it already has a Primary Key to be indexed. Will the database works faster with a sequencial ID field? I dont see how it helps if I wont use it to link or research any table.
Does it helps? If so, why, what happens in the database?
thanks!
EDIT -----
This is just a silly example. Forget about the SS_ID, I know there are better ways for choosing a primary key. The main topi is because some people I know just ask me to add the collumn named ID, even if I know we wont use it for any SQL query. They just think it helps the database's performance in some way, specially because some database tools like Microsoft Access always asks us if we want it to add this new column.
This is wrong, right?

If SS means "Social Security", I'd strongly advise against using that as a PK. An auto-incremented identity is the way to go.
Using keys with business logic built in is a bad idea. Lots of people are sensitive about giving SS information. Your app could be eliminating part of their audience if they use SS as primary key. Laws like HIPPA can make it impossible for you to use.

The actual performance gain in having a sequential id is going to depend a lot on how you use the table.
If you're using some ORM framework, these generally work better having a sequential ID of an integral type [1], which is typically achieved with an sequential id column.
If you don't use an ORM framework, having an idkey that you never use and a surrogate ss_id key which is effectively what you always use makes little sense.
If you're referencing employees from other database table (foreign-key), then it'll probably be more efficient to have an id column, as storing that integer is going to consume less space in the child tables than storing the ss_id (which I assume is a CHAR or VARCHAR) everywhere.
On the ss_id, assuming it's a social security number (looks like it would be), there might be legal & privacy concerns attached to it that you should care about - my answer assumes you do have valid reasons to have social security numbers in your database, and that you would be legally allowed to use & store them.
[1] This is usually explained by the fact the ORM frameworks rely on having highly specialized cache mechanisms, that are tailored for typical ORM use - which usually implies having a sequential id primary key, and letting application deal with actual business identity. This is in fact related to consideration very similar to these of the "foreign key" considerations.

US Social Security numbers are not sufficiently identifying. And banks certainly do not use them in that way. Not everybody has one. Errors result in duplicates. Foreigners don't have them. They are far too fragile to use as a database PK.
Most importantly: the are resused after death
Do some research: SSN as Primary Key

What's more important (obviously) is that you have a primary key, as long as the data you put use for that primary key will be uniquely identifiable. In your example, SSN's are uniquely identifiable which is why banks use them and will work. The problem with this example is that your Employee ID is likely to be used as a Foreign Key in other tables, which means you're taking personal information (that is legally protected) and spraying it across your data model. You might do better using an Auto Incremented field in this case.

Foreign key column optionally contains NULL or ID. Is there a better design?

I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?

I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.

Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.

You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.

I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.

NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.

Database Structure for a website commenting system

I'm working on a website currently that needs a commenting system. As this website is brand new, and the database structure has yet to be set in stone, I would like some suggestions on how to best handle a commenting system such as this:
Comments must be able to be placed on anything. Including items in future tables.
Comments must be quickly (and easily?) queryable.
I know that this alone is not much to go on, so here is the idea: Each university has Colleges, each College has Buildings, and each Building has Rooms. Every user should be able to comment on any of these four items (and future ones we may add later), but I'd like to avoid making a comments table for each item.
The solution I have come up with this far seems to work, but I'm open to other ideas as well. My solution is to use UUIDs as the primary key for each item (university, college, building, room) table, then have the reference id in the comments table be that UUID. While I don't think I can make a system of foreign keys to link everything, I believe that nothing will break as only items available can possibly have comments, therefore an item can either have no comments, or if it is deleted, then the comments simply will never be returned.
University:
UniversityID - CHAR(36) //UUID() & primary key
...
Comments:
CommentID - CHAR(36) //UUID() & primary key
CommentItemID - CHAR(36) //UUID of item & indexed
CommentUserID - INTEGER
CommentBody - TEXT
And then queries will appear like:
SELECT * FROM University, Comments WHERE UniversityID = CommentItemID;
So what do you all think? Will this system scale will with large amounts of data, or is there a better (maybe Best Practice or Pattern) way?
I thank you in advance.
Edit 1: I have altered the Comment definition to include a primary key and indexed column to address the issues raised thus far. This way the system can also have comments of comments (not sure how confusing this would be in practical code, but it has a certain mathematical completeness to it that I like). I wanted to keep the system as similar as possible though until I have accepted an answer.
Both answers so far by Sebastian Good and Bryan M. have suggested a dual primary key of two integers being something like ItemID and TableID. My only hesitation with this method is that I would either have to have a new table listing the TableIDs and their correstponding string table names, or introduce global variables into my code referencing them. Unless there is another method I am missing, this seems like extra code that can be avoided to me.
What do you all think?

I would just take a more traditional approach to the foreign key relationship between the comments and whatever they're bound to.
UNIVERSITY
UniversityID // assuming primary key
COMMENTS
CommentID // assuming primary key
TypeID // Foreign Key
Type // Name of the table where the foreign key is found (ie, University)
This just feels a bit cleaner to me. Some about using a foreign key of another table as the primary key for your comments didn't feel right.

If you use a UUID, it's hard to know what table it came from. If you only ever want to look from the entity down to the comments, as in your query, it'll work alright. If you want to look at a comment and find out what it was on, you'll have to look at all possible tables (universities, buildings, etc.) to find out.
One possibility which enables you to use simple sequential integers for keys of your base entities (which is often desirable for readability, index fragmentation, etc.) is to make the key of your comments table contain two columns. One is the name of the table the comment applies to. The second is the key of that table. This is similar to the approach Bryan M. suggests, though note that you won't be able to actually define foreign keys from the comments table to all possible parents. Your queries will work both ways round if necessary, and you don't need to worry about UUIDs, as the combination of table name + ID will be unique across the database.

Well, since no one appears to want to answer, I guess I'll just stick with my method. However, I'll still be open to taking other suggestions.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008