Nightmare on deciding database schema - mysql

I am in greatest nightmare on deciding a database schema ! Recently signed of my first freelancer project:
It has a user registration, and there is pretty decent requirements on user table as follows:
name
password
email
phone
is_active
email_verified
phone_verified
is_admin
is_worker
is_verified
has_payment
last_login
created_at
Now am at huge confusion to decide whether to put everything under a single table or split things, as still i need to add few more fields like
token
otp ( may be in future )
otp_limit ( may be in future ) // rate limiting
And may be something more in future when there is an update: I am afraid that, if there is an future update with new field to table then how to add that again if its a single table
And if i split things will that cause performance issue ? As most of the fields are moderately used on the webapp:
Please help me to decide, this is my first freelancing experience ( and its pretty tough and rough ) :(

If two tables have the same PRIMARY KEY, they should (with few exceptions) be combined in the same table. So, one table.
As for adding columns for future expansion, don't. Do ALTER TABLE .. ADD COLUMN .. when new columns are needed.
Once you have more than a million rows, adding a column becomes invasive, so try to get most new columns added before then.
You mentioned payment. If there is only one payment, simply have a column(s) with the amount and/or date. Make them NULLable to indicate that it has not been paid yet. If there will be multiple payments, then have another table dedicated to "payments", with zero or more rows for the payments.
That NULL technique won't work for a "verified" flag; it does need a separate column.
is_worker, is_admin -- Consider a single column that is an ENUM or SET to provide boolean "attributes for the user. Use SET if, for example, a user can be both a worker and an admin.
Each "entity" (users, payments, etc) should be a database table. "Relations between tables are 1:1 (which I argued against, above), 1:many (eg, user_id in the Payments table), or many:many (with an extra table with 2 ids).

Related

what is the best practice - a new column or a new table?

I have a users table, that contains many attributes like email, username, password, phone, etc.
I would like to save a new type of data (integer), let's call it "superpower", but only very few users will have it. the users table contains 10K+ records, while fewer than 10 users will have a superpower (for all others it will be null).
So my question is which of the following options is more correct and better in terms of performance:
add another column in the users table called "superpower", which will be null for almost all users
have a new table calles users_superpower, which will at most contains 10 records and will map users to superpowers.
some things i have thought about:
a. the first option seems wasteful of space, but it really just an ingeger...
b. the second option will require a left join every time i query the users...
c. will the answer change if "superpower" data was 5 columns, for example?
note: i'm using hibenate and mysql, if it changes the answer
This might be a matter of opinion. My viewpoint on this follows:
If superpower is an attribute of users and you are not in the habit of adding attributes, then you should add it as a column. 10,000*4 additional bytes is not very much overhead.
If superpower is just one attribute and you might add others, then I would suggest using JSON or another EAV table to store the value.
If superpower is really a new type of user with other attributes and dates and so on, then create another table. In this table, the primary key can be the user_id, making the joins between the tables even more efficient.
I would go with just adding a new boolean field in your user entity which keeps track of whether or not that user has superpowers.
Appreciate that adding a new table and linking it requires the creation of a foreign key in your current users table, and this key will be another column taking up space. So it doesn't really get around avoiding storage. If you just want a really small column to store whether a user has superpowers, you can use a boolean variable, which would map to a MySQL BIT(1) column. Because this is a fixed width column, NULL values would still take up a single bit of space, but this not a big storage concern most likely as compared to the rest of your table.

MySQL: better to have two tables or two columns

I have a database with contacts in it. There are two different types of contacts, Vendors and Clients.
The Vendor table has a vendor_contacts table attached via foreign key value to allow for a one to many relationship. The client has a similar table.
These contacts can have a one or many relationship with a phone numbers table. Should i have a separate phone numbers table for each of these or one shared phone number table with two foreign keys allowing one to be null?
OPTION 1
Here I would have to enforce that one of vendor_id or client_id was NULL and the other not NULL in the shared phone table.
OPTION 2
Here each table would have its own phone number table.
TBH I would merge the vendor and client tables and have a 'contact' table. This could have a contact type and would allow for newer contacts to be added.
Consider you want to add something to your contacts - address, you may have to change each table in the same way, then you want birthday (OK maybe not but just as an example) and again, changes to multiple tables. Whereas if you have a single table, it can reduce the overhead of managing this.
This will also mean you have one contact phone number table!
"wasting space" is not really a meaningful concern in modern database systems - and "null" values are usually optimized by the storage engine to take no space anyway.
Instead, I think you need to look at likely query scenarios, at maintainability, and at intelligibility of your schema.
So, in general, a schema that repeats itself - many tables with similar columns - suggest poor maintainability, and often lead to complicated queries.
In your example, imagine a query to find out who called from a given number, and whom they might have been trying to reach.
In option 1, you query the phone number, and outer join it to the two contact tables - relatively easy. In option 2, you have a union of two similar queries (only the table names would change) - duplication and lots of chance for bugs.
Imagine you want to break the phone number into country, region and phone number - in option 2, you have to do this twice (and modify all the queries twice); in option 1, you have to do this only once.
In general terms, repetition is a sign of a bad software design; this also counts for database schemas.
That's also a reason (as #siggisv and #NigelRen suggested) to flatten the vendor_contact and client_contact tables into a single table with a "contact_type" column.
I would use two different tables, a vendor_contacts table and a client_contacts table.
If you only have one table, you always waste space as you will have in each row a null column
option 2
but change vendor_contact and client_contact to 'contact'
and add a 'type' column to 'contact' that identified 'Client' or 'vendor' if you need to separate the records.
I would do as others have suggested and merge vendor_contact and client_contact into one contact table.
But on top of that, I doubt that contact<->phone is a one-to-many relationship. If you consider this example you will see that it's a many-to-many relationship:
"Joe and Mary are both vendors, working in the same office. Therefore they both have the same landline number. They also have each their own mobile number."
So in my opinion you would need to add a contact_number table with two columns of foreign keys, contact_id and phone_id.

Database design for time dependent fields

I am making a MySQL database and am fairly confident I know how to normalize it. However, there is an issue I am not sure how to deal with.
Say I have a table
users
----------
user_id primary key
some_field
some_field2
start_date
user_level
Now, user_level gives the user's level, which can be 1,2,3,4,5 say. But as time passes the user may change levels. Obviously if they change levels I can simply do an UPDATE to the users table. But I want to keep a historical record of the users' past levels
For this reason, I am considering a new table called user_level_history
user_level_history
--------------
id autoincrement primary key
user_id
level_start_date
and then modify the users table:
users
----------
user_id primary key
some_field
some_field2
start_date
user_level_history_id
Then to get the user's current level I check the
user_level_history_id = user_level_history.id
And to get the user's history I can SELECT from user_level_history all rows with the user_id and order chronologically.
Is this the standard way to do this? I can't imagine I'm the first person to come across this problem.
One more point: I am imagining less than 5000 users. Would having many, many more users require a different solution?
Thanks in advance.
I think that could be designed like this:
Have a table for level information like value(1,2,3,4,5) , description ...
Have an association table for user_level_history containing user_id, level_id,level_start_date ...
Have a foreign key from level table to user table with the role user-active-level.
You need to develop a mechanism that when user level is changing, inserting to history table occurs.
No, you aren't the first. Querying temporal data is a common requirement, especially in data warehouse/data mining.
The relational data model doesn't have any native, built in support for storing or querying "temporal data".
A lot of work has been done; I have a book by C.J.Date et al. that covers the topic decently: "Temporal Data and the Relational Model". I've also come across several white papers.
One typical, reasonably simplistic approach to storing a "history" is to have a "current" table (like the one you already have, and then add a "history" table. Whenever a row is changed (inserted,updated,deleted) in the "current" table, you add a row to the "history" table, along with the date the row was changed. (You can store a copy of the pre-change row, or a copy of the post-change row, or both.)
With this approach, there's no need to add any columns to the "current" table.

about database design

I need some idea about my database design. I have about 5 fields for basic information of user, such as name, email, gender etc.
Then I want to have about 5 fields for optional information such as messenger id's.
And 1 optional text field for info about user.
Should i create only one tabel with all fields all together or i should create separate table for the 5 optional fields in order to avoid redundancy etc?
Thanks.
I'll stick with only one table.
Adding another table would only makes thins more complicated and you will only gain really little disk space.
And I really don't see how this can be redundant in any way ;)
I think that you should definately stick with one table. Since all information is relevant to a user and do not reflect any other logical model (like an article, blog post or such), you can safely keep everything in one place, even if they are optional.
I would create only one table for additional fields. But not with 5 fields but a foreign key relation to base table and key/pair value info. Something like:
create table users (
user_id integer,
name varchar(200),
-- the rest of the fields
)
create table users_additional_info (
user_id integer references users(user_id) not null,
ai_type varchar(10) not null, -- type of additional info: messenger, extra email
ai_value varchar(200) not null
)
Eventually you might want an additional_info table to hold possible valid values for extra info: messenger, extra email, whatever. But that is up to you. I wouldn't bother.
It depends on how many people will be having all of that optional information and whether you plan on adding more fields. If you think you're going to add more fields in the future, it might be useful to move that information to a meta table using the EAV pattern : http://en.wikipedia.org/wiki/Entity-attribute-value_model
So, if you're unsure, your table would be like
User : id, name, email, gender, field1, field2
User_Meta : id, user_id, attribute, value
Using the user_id field in your meta table, you can link it to your user table and add as many sparsely used optional fields as you want.
Note : This pays off ONLY if you have many sparsely populated optional fields. Otherwise have it in one field
I would suggest using a single table for this. Databases are very good at optimizing away space for empty columns.
Splitting this table out into two or more tables is an example of vertical partitioning and in this case is likely to be a case of premature optimization. However, this technique can be useful when you have columns that you only need to query some of the time, eg. large binary blobs.

unnecessary normalization

My friend and I are building a website and having a major disagreement. The core of the site is a database of comments about 'people.' Basically people can enter comment and they can enter the person the comment is about. Then viewers can search the database for words that are in the comment or parts of the person name. It is completely user generated. For example, if someone wants to post a comment on a mispelled version of a person's name, they can, and that's OK. So there may be multiple spellings of different people listed as several different entries (some with middle name, some with nickname, some mispelled, etc.), but this is all OK. We don't care if people make comments about random people or imaginary people.
Anyway, the issue is about how we are structuring the database. Right now it is just one table with the comment ID as the primary key, and then there is a field for the 'person' the comment is about:
comment ID - comment - person
1 - "he is weird" - John Smith
2 - "smelly girl" - Jenny
3 - "gay" - John Smith
4 - "owes me $20" - Jennyyyyyyyyy
Everything is working fine. Using the database, I am able to create pages that list all the 'comments' for a particular 'person.' However, he is obsessed that the database isn't normalized. I read up on normalization and learned that he was wrong. The table IS currently normalized, because the comment ID is unique and dictates the 'comment' and the 'person.' Now he is insistant that 'person' should have it's OWN table because it is a 'thing.' I don't think it is necessary, because even though 'person' really is the bigger container (one 'person' can have many 'comments' about them), the database seems to operate just fine with 'person' being an attribute of the comment ID. I use various PHP calls for different SQL selections to make it magically appear more sophisticated on the output and the different way the user can search and see results, but in reality, the set-up is quite simple. I am now letting users rank comments with thumbs up and thumbs down, and I keep a 'score' as another field on the same table.
I feel that there is currently no need to have a separate table for just unique 'person' entries because the 'persons' don't have their own 'score' or any of their own attributes. Only the comments do. My friend is so insistant that it is necessary for efficiency. Finally I said, "OK, if you want me to create a separate table and let 'person' be it's own field, then what would be the second field? Because if a table has just a single column, it seems pointless. I agree that we may later create a need to give 'person' it's own table, but we can deal with that then." He then said that strings can't be primary keys, and that we would convert the 'persons' in the current table to numbers, and the numbers would be the primary key in the new 'person' table. To me this seems unnecessary and it would make the current table harder to read. He also thinks it will be impossible to create the second table later, and that we need to anticipate now that we might need it for something later.
Who is right?
In my opinion your friend is right.
Person should live in a different table and you should try to normalize. Don't overdo-it, though.
In the long run you may want to do more things with your site, say you want to attach multiple files to a person (ie. pictures) you'll be very thankfull then for the normalization.
Creating a new table for person and using the key of that table in place of the person attribute has nothing to do with normalization. It may be a good idea for other reasons but doing so does not make the database "more normalized" than not doing it. So you are right: as far as normalization is concerned, creating another table is unnecessary.
I would vote for your friend. I like to normalize and plan for the future and even if you never need it, this normalization is so easy to do it literally takes no time. You can create a view that you query in order to make your SQL cleaner and eliminate the need for you to join the tables yourself.
If you have already reached all of your capabilities and have no plans for expansion of capabilities I think you leave it as it is.
If you plan to add more, namely allowing people to have accounts, or anything really, I think it might be smart to separate your data into Person, Comments tables. Its not hard and makes expanding your functionality easier.
You're right.
Person may be a thing in general, but not in your model. If you were going to hassle people into properly identifying the person they're talking about, a Person table would be necessary. For example, if the comments were only about persons already registered in the database.
But here it looks like you have an unstructured data, without identity; and that nothing/nobody is interested in making sure whether "jenny" and "jennyyy" are in fact the same person, not to mentionned "jenny doe", and "my cousin"...
Well, there are two schools of thought. One says, create your data model in the most normalized way possible, then de-normalize if you need more efficiency. The other is basically "do the minimum work necessary for the job, then change it as your requirements change". Also known as YAGNI (You aren't going to need it).
It all depends on where you see this going. If this is all it will be, then your approach is probably fine. If you intend to improve it with new features over time, then your friend is right.
If you never intend to associate the person column with a user or anything else and data apparently needs no consistency or data integrity checks, just why is this in a relational database at all? Wouldn't this be a use case for a nosql database? Or am I missing something?
Normalization is all about functional dependencies (FD's). You need to identify all of the
FD's that exist among the attributes of your data model before it can be fully normalized.
Lets review what you have:
Any given instance of a CommentId functionally determines the Person (FD: CommentId -> Person)
Any given instance of a CommentId functionally determines the Comment (FD: CommentId -> Comment)
Any given instance of a CommentId functionally determines the UserId (FD: CommentId -> UserId)
Any given instance of a CommentId functionally determines the Score (FD: CommentId -> Score)
Everything here is a dependant attribute on CommentId and
CommentId alone. This might lead you to the belief that a relation (table) containing all of, or a subset of, the
above attributes must be normalized.
First thing to ask yourself is why did you create the CommentId attribute anyway? Strictly speaking,
this is a manufactured attribute - it does not relate to anything 'real'. CommentId is
commonly referred to as a surrogate key. A surrogate key is just a made up value that stands in
for a unique value set corresponding to some other group of attributes. So what group of attributes is CommentId
a surrogate for? We can figure that
out by asking the following questions and adding new FD's to the model:
1) Does a Comment have to be unique? If so the FD: Comment -> CommentId must be true.
2) Can the same Comment be made multiple times as long as it is about a different Person? If so, then
FD: Person + Comment -> CommentId must be true and the FD in 1 above is false.
3) Can the same Comment be made multiple times about the same Person provided it was made by
different UserId's? If so, the FDs in 1 and 2 cannot be true but
FD: Person + Comment + UserId -> CommentId may be true.
4) Can the same Comment be made multiple times about the same Person by the same UserId but
have different Scores? This implies FD: Person + Comment + UserId' + Score -> CommentId is true and the others are false.
Exactly one of the above 4 FD's above must be true. Whichever it is affects how your data model is normalized.
Suppose FD: Person + Comment + UserId -> CommentId turns out to be true. The logical
consequences are that:
Person + Comment + UserId and CommentId serve as equivalent keys with respect to Score
Score should be put in a relation with one but not both of its keys (to avoid transitive dependencies).
The obvious choice would be CommentId since it was specifically created as a surrogate.
A relation comprised of: CommentId, Person, Comment, UserId is needed to tie the
Key to its surrogate.
From a theoretical point of view, the surrogate key CommentId is not
required to make your data model or database work. However, its presence may affect how relations are constructed.
Creation of surrogate keys is a practical issue of some importance.
Consider what might happen if you choose to not use a surrogate key but the full
attribute set Person + Comment + UserId in its place, especially if it was required
on multiple tables as a foreign or primary key:
Comment might add a lot of space overhead
to your database because it is repeated in multiple tables. It is probably more than a couple of characters long.
What happens if someone chooses to edit a Comment? That change needs to be propagated
to all tables where Comment is part of a key. Not a pretty sight!
Indexing long complex keys can take a lot of space and/or make for slow update performance
The value assigned to a surrogate key never changes, no matter what you do to the values
associated to the attributes that it determines. Updating the dependant attributes is now
limited to the one table defining the surrogate key. This is of huge practical significance.
Now back to whether you should be creating a surrogate for Person. Does Person live
on the left hand side of many, or any, FDs? If it does, its value will propogate through your
database and there is a case for creating a surrogate for it. Whether Person is a text or numeric attribute is irrelevant to the choice of creating a surrogate key.
Based on what you have said, there is at best a weak argument to create a
surrogate for Person. This argument is based on the suspicion that its value may at some point become a key or part of a key at some point in the future.
Here's the deal. Whenever you create something, you want to make sure that it has room to grow. You want to try to anticipate future projects and future advancements for your program. In this scenario, you're right in saying that there is no need currently to add a persons table that just holds 1 field (not counting the ID, assuming you have an int ID field and a person name). However, in the future, you may want to have other attributes for such people, like first name, last name, email address, date added, etc.
While over-normalizing is certainly harmful, I personally would create another, larger table to hold the person with additional fields so that I can easily add new features in the future.
Whenever you're dealing with users, there should be a dedicated table. Then you can just join the tables and refer to that user's ID.
user -> id | username | password | email
comment -> id | user_id | content
SQL to join the comments to the users:
SELECT user.username, comment.content FROM user JOIN comment WHERE user.id = comment.user_id;
It'll make it so much easier in the future when you want to find information about that specific user. The amount of extra effort is negligible.
Concerning the "score" for each comment, that should also be a separate table as well. That way you can connect a user to a "like" or "dislike."
With this database, you might feel that it is okay but there may be some problem in the future when you want the users to know more from the database.Suppose you want to know about the number of comments made on a person with the name='abc'.In this case ,you will have to go through the entire table of comments and keep counting.In place of this, you can have an attribute called 'count' for every person and increment it whenever a comment is made on that person.
As far as normalization is concerned,it is always better to have a normalized database because it reduces redundancy and makes the database intuitive to understand. If you are expecting that your database will go large in future then normalization must be present.