I am having in greatest nightmare on deciding a database schema ! Recently signed up to my first freelancer project.
It has a user registration, and there is pretty decent requirements on user table as follows:
- name
- password
- email
- phone
- is_active
- email_verified
- phone_verified
- is_admin
- is_worker
- is_verified
- has_payment
- last_login
- created_at
Now am at huge confusion to decide whether to put everything under a single table or split things, as still I need to add few more fields like
- token
- otp ( may be in future )
- otp_limit ( may be in future ) // rate limiting
And may be something more in future when there is an update: I am afraid that, if there is an future update with new field to table then how to add that again if it's a single table.
And if I split things will that cause performance issue ? As most of the fields are moderately used on the webapp.
How can I decide?
Your initial aim should be to create a model that is in 3rd Normal Form (3NF). Once you have that, if you then need to move away from a strict 3NF model in order to effectively handle some specific operational requirements/challenges then that's fine - as long as you know what your doing.
A working/simplified definition of whether a model is in 3NF is that all attributes that can be uniquely identified by the same key should be in the same table.
So all attributes of a user should be in the same table (as long as they have a 1:1 relationship with the User ID).
I'm not sure why adding new columns to a table in the future is worrying you - this should not affect a well-designed application. Obviously altering/dropping columns is a different matter.
As commented, design database according to your business or project use case and narrative in mind. Essentially, you need a relational model of Users, Portfolios, and Stocks where Users can have many Portfolios and each Portfolio can contain many Stocks. If you need to track Registrations or Logins, add that to schema where Users can have multiple Registrations or Logins. Doing so, you simply add rows with corresponding UserID and not columns.
Also, consider best practices:
Use Lookup Tables: For static (or rarely changed) data shared across related entities, incorporate lookup tables in relation model like Tickers (with its ID referenced as foreign key in Stocks). Anything that regularly changes at specific level (i.e., user-level) should be stored in that table. Remember database tables should not resemble spreadsheets with repeated static data stored within it.
Avoid Data Elements in Columns: Avoid wide-formatted tables where you store data elements in columns. Tables with hundreds of suffixed or dated columns is indicative of this design. Doing this you avoid clearly capturing Logins data and force a re-design such as ALTER commands for new column with every new instance. Always normalize data for storage, efficiency, and scaling needs.
UserID
Login1
Login2
Login3
...
10001
...
...
...
...
10002
...
...
...
...
10003
...
...
...
...
Application vs Data Centric Design: Depending on your use case, try to not build database with specific application in mind but as a generalized solution for all users including business personnel, CEOs to regular staff, and maybe even data scientists. Therefore, avoid short names, abbreviations (like otp), industry jargon, etc. Everything should be clear and straightforward as much as possible.
Additionally, avoid any application or script that makes structural changes to database like creating temp tables or schemas on the fly. There is a debate if business logic should be contained in database or run in specific application. Usually data should be handled between database and application. Keep in mind , MySQL is a powerful (though free), enterprise, server-level RDBMS and not a throwaway file-level, small scale system.
Maintain Consistent Signature: Pick a naming convention and stick to it throughout the design (i.e., camelcase, snake case, plurals). There is a big debate if you should prefix objects tbl, vw, and sp. One strategy is to name data objects by its content and procedures/functions by its action. Always avoid reserved words and special characters and spaces in names.
Always Document: While very tedious for developers, document every object, functionality, and extension and annotate table and fields for definitions. MySQL supports COMMENTS in CREATE statements for tables and fields. And use # or -- for comments in stored procedures or triggers.
Once designed and in production, databases should rarely (if not ever) be restructured. So carefully think of all possibilities and scenarios beforehand with your use case. Do not dismiss the very important database design step. Good luck!
Related
I am currently working with a database designed by someone else from my team and he uses a database design style I have not encountered before. I was wondering if the following design would be good practice since it seems kind of cumbersome to me.
There is a 'normal' database with user and business specific information. For the 'types' in this database, for example a user, there exists a table in two separate databases, namely status and types.
The status for a user is simply a name and a description (for example active or deleted).
The type for users is not really clear to me, but the table consists of a name, a description, a subset and a level field.
The cumbersome part would be the linking of these tables, since the they exist in different databases and the user table requires keys for both status and types (not enforceable via foreign keys).
Wouldn't it be better to have a simple boolean field to indicate whether the user is active and for types, if there will ever be any which is not likely, use inheritance?
such these users may be beginner, so you can decide if your user status checked in code, for example when you login if query have this: where status = 'active' or like this so in this case you don't need a table and user status is static values you can include in your source code, also you should consider the language for these status if your system support multi languages. the same with the type field.
but if you don't need to check these flags in your code so there is no problem to leave them in table, but it will be nice in this case to give the user ability to add types or status after that.
I think your question covers two different aspects:
1) data cohesion and integrity - referencing data from other tables which can or not be in the same database
2) normalization - do I need status and types in other tables or they can be incorporated in the same table?
1) If you are not working with huge data (at least tens of millions of records), I would recommend to replicate status and types tables in the "normal" database. This is particularly recommended, if data from there is rarely changed.
Doing this allows to apply referential constraints (FKs) and also have faster JOINs.
2) Although it adds some complexity (extra table, defining constraints etc.), having your data normalized may bring some important advantages:
flexibility - if a status or type is added, it just mean a simple insert in a table
smaller tables - users table stores only some ids for status and type, not strings or hard to guess values (e.g. 0 - inactive, 1 - active etc.)
easier maintenance - a type name is changed? Just update a record in a table
Normalized structures usually speak for themselves if designed properly (PKs, FKs, check constraints etc.) and allow separation of concerns (maybe you implement a designer for user types in some point in the future)
Usually, database separation should be done based on the activity type:
operational (lots of INSERTs, UPDATEs, DELETEs beside SELECTs)
reporting (mainly SELECTs)
ETL destination - heavy INSERTs, UPDATEs etc.
Below, I explain a basic design for a database I am working on. As I am not a DB, I am concerned if I am on a good track or a bad one so I wanted to float this on stack for some advice. I was not able to find a similar discussion that fit's my design.
In my database, every table is considered an entity. An Entity could be a customer account, a person, a user, a set of employee information, contractor information, a truck, a plane, a product, a support ticket, etc etc. Here are my current entities (Tables)...
People
Users
Accounts
AccountUsers
Addresses
Employee Information
Contractor Information
And to store information about these Entities I have two tables:
Entity Tables
-EntityType
-> EntityTypeID (INT)
-Entities
-> EntityID (BIGINT)
-> EnitityType (INT) : foreign key
Every table I have made has an Auto Generated primary key, and a foreign key on an entityID column to the entities table.
In the entities table I have some shared fields like,
DateCreated
DateModified
User_Created
User_Modified
IsDeleted
CanUIDelete
I use triggers on all of the table's to automatically create their entity entry with the correct entity type on inserts. And update triggers update the LastModified date.
From an application layer point of view, all the code has to do is worry about the individual entities (except for the USER_Modified/User_Created fields "it does updates on that" by joining on the entityID)
Now the reason for the entities table, is down the line I plan on having an EAV model, so every entity type can be extended with custom fields. It also serves as a decent place to store metadata about the entities (like the created/modified fields).
I'm just new to DB design, and want a 2nd opinion.
I plan on having an EAV model, so every entity type can be extended with custom fields.
Why? Do all your entities require to be extensible in this way? Probably not -- in most applications there are one or two entities at most that would benefit from this level of flexibility. The other entities actually benefit from the stability and clarity of not changing all the time.
EAV is an example of the Inner-Platform Effect:
The Inner-Platform Effect is a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with.
In other words, now it's your responsibility to write application code to do all the things that a proper RDBMS already provides, like constraints and data types. Even something as simple as making a column mandatory like NOT NULL doesn't work in EAV.
It's true sometimes a project requires a lot of tables. But you're fooling yourself if you think you have simplified the project by making just two tables. You will still have just as many distinct Entities as you would have had tables, but now it's up to you to keep them from turning into a pile of rubbish.
Before you invest too much time into EAV, read this story about a company that nearly ceased to function because someone tried to make their data repository arbitrarily flexible: Bad CaRMa.
I also wrote more about EAV in a blog post, EAV FAIL, and in a chapter of my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
You haven't really given a design. If you had given a description of tables, the application-oriented criterion for when a row goes in of each them and consequent constraints including keys, fks etc for the part of your application involving your entities then you would have given part of a design. In other words, if you had given that part's straightforward relational design. (Just because you're not implementing it that way doesn't mean you don't need to design properly.) Notice that this must include application-level state and functionality for "extending with custom fields". But then you have to give a description of tables, the criterion for when a row goes in each of them and consequent constraints including keys, fks etc for the part of your implementation that encodes the previous part via EAV, plus operators for manipulating them. In other words, if you had given that part's straightforward relational design. The part of your design that is implementing a DBMS. Then you would really have given a design.
The notion that one needs to use EAV "so every entity type can be extended with custom fields" is mistaken. Just implement via calls that update metadata tables sometimes instead of just updating regular tables: DDL instead of DML.
This is a question that has probably been asked before, but I'm having some difficulty to find exactly my case, so I'll explain my situation in search for some feedback:
I have an application that will be registering locations, I have several types of locations, each location type has a different set of attributes, but I need to associate notes to locations regardless of their type and also other types of content (mostly multimedia entries and comments) to said notes. With this in mind, I came up with a couple of solutions:
Create a table for each location type, and a "notes" table for every location table with a foreign key, this is pretty troublesome because I would have to create a multimedia and comments table for every comments table, e.g.:
LocationTypeA
ID
Attr1
Attr2
LocationTypeA_Notes
ID
Attr1
...
LocationTypeA_fk
LocationTypeA_Notes_Multimedia
ID
Attr1
...
LocationTypeA_Notes_fk
And so on, this would be quite annoying to do, but after it's done, developing on this structure should not be so troublesome.
Create a table with a unique identifier for the location and point content there, like so:
Location
ID
LocationTypeA
ID
Attr1
Attr2
Location_fk
Notes
ID
Attr1
...
Location_fk
Multimedia
ID
Attr1
...
Notes_fk
As you see, this is far more simple and also easier to develop, but I just don't like the looks of that table with only IDs (yeah, that's truly the only objection I have to this, it's the option I like the most, to be honest).
Similar to option 2, but I would have an enormous table of attributes shaped like this:
Location
ID
Type
Attribute
Name
Value
And so on, or a table for each attribute; a la Drupal. This would be a pain to develop because then it would take several insert/update operations to do something on a location and the Attribute table would be several times bigger than the location table (or end up with an enormous amount of attribute tables); it also has the same issue of the surrogate-keys-only table (just it has a "type" now, which I would use to define the behavior of the location programmatically), but it's a pretty solution.
So, to the question: which would be a better solution performance and scalability-wise?, which would you go with or which alternatives would you propose? I don't have a problem implementing any of these, options 2 and 3 would be an interesting development, I've never done something like that, but I don't want to go with an option that will collapse on itself when the content grows a bit; you're probably thinking "why not just use Drupal if you know it works like you expect it to?", and I'm thinking "you obviously don't know how difficult it is to use Drupal, either that or you're an expert, which I'm most definitely not".
Also, now that I've written all of this, do you think option 2 is a good idea overall?, do you know of a better way to group entities / simulate inheritance? (please, don't say "just use inheritance!", I'm restricted to using MySQL).
Thanks for your feedback, I'm sorry if I wrote too much and meant too little.
ORM systems usually use the following, mostly the same solutions as you listed there:
One table per hierarchy
Pros:
Simple approach.
Easy to add new classes, you just need to add new columns for the additional data.
Supports polymorphism by simply changing the type of the row.
Data access is fast because the data is in one table.
Ad-hoc reporting is very easy because all of the data is found in one table.
Cons:
Coupling within the class hierarchy is increased because all classes are directly coupled to the same table.
A change in one class can affect the table which can then affect the other classes in the hierarchy.
Space potentially wasted in the database.
Indicating the type becomes complex when significant overlap between types exists.
Table can grow quickly for large hierarchies.
When to use:
This is a good strategy for simple and/or shallow class hierarchies where there is little or no overlap between the types within the hierarchy.
One table per concrete class
Pros:
Easy to do ad-hoc reporting as all the data you need about a single class is stored in only one table.
Good performance to access a single object’s data.
Cons:
When you modify a class you need to modify its table and the table of any of its subclasses. For example if you were to add height and weight to the Person class you would need to add columns to the Customer, Employee, and Executive tables.
Whenever an object changes its role, perhaps you hire one of your customers, you need to copy the data into the appropriate table and assign it a new POID value (or perhaps you could reuse the existing POID value).
It is difficult to support multiple roles and still maintain data integrity. For example, where would you store the name of someone who is both a customer and an employee?
When to use:
When changing types and/or overlap between types is rare.
One table per class
Pros:
Easy to understand because of the one-to-one mapping.
Supports polymorphism very well as you merely have records in the appropriate tables for each type.
Very easy to modify superclasses and add new subclasses as you merely need to modify/add one table.
Data size grows in direct proportion to growth in the number of objects.
Cons:
There are many tables in the database, one for every class (plus tables to maintain relationships).
Potentially takes longer to read and write data using this technique because you need to access multiple tables. This problem can be alleviated if you organize your database intelligently by putting each table within a class hierarchy on different physical disk-drive platters (this assumes that the disk-drive heads all operate independently).
Ad-hoc reporting on your database is difficult, unless you add views to simulate the desired tables.
When to use:
When there is significant overlap between types or when changing types is common.
Generic Schema
Pros:
Works very well when database access is encapsulated by a robust persistence framework.
It can be extended to provide meta data to support a wide range of mappings, including relationship mappings. In short, it is the start at a mapping meta data engine.
It is incredibly flexible, enabling you to quickly change the way that you store objects because you merely need to update the meta data stored in the Class, Inheritance, Attribute, and AttributeType tables accordingly.
Cons:
Very advanced technique that can be difficult to implement at first.
It only works for small amounts of data because you need to access many database rows to build a single object.
You will likely want to build a small administration application to maintain the meta data.
Reporting against this data can be very difficult due to the need to access several rows to obtain the data for a single object.
When to use:
For complex applications that work with small amounts of data, or for applications where you data access isn’t very common or you can pre-load data into caches.
I'm working on a legacy app - right now, we allow admins to generate forms with custom fields (they create the field, choose an input type, label, etc).
When the user fills out this custom form, all fields from that form are checked - if that field is not a column on the users table, we add it to the users table as a column.
For example, if an admin added a field called "flight arrival time", we would add a column called "flight_arrival_time" to the users table, and the User model would have an attribute called #user.flight_arrival_time.
What alternatives might there be to this current course of action? Is there a more efficient way of storing these values?
Here are some of the limitations:
We have tens of thousands of users ( I was told that storing these attributes in a different table and joining them would slow the system A LOT. We often have around 20 or so admins querying, importing, updating, and generally using the hell out of our system, which is already pretty slow under load. I wouldn't have the power to say "buy more {X} so we can be faster. ).
I assume a join table (called something like user_attributes) would store the user_id, the attribute name, and the attribute value. If each user has an additional 15 attributes, and we have 100,000 users, how much slower will it be?
The storage must be easily query-able ( We use a dynamic search that allows the users to choose any column from the User model as a search field and find an inputted value ).
Would you option allow easy queries (for instance, find all users whose attribute named "Flight Arrival Time" is tomorrow). Would this also become very slow?
I will experiment a bit, generate some of the proposed schema, generate 100,000 users and 20 attributes for each, and run some test queries to check execution times, but I'd like some input on where to start.
Thanks for your input.
not exactly an answer, but i think that this kind of app would benefit a lot from a document-oriented database / NOSQL
system like mongoDB.
Such systems are schema-less by design.
To add my two cents, let users make dynamic changes on the schema seems a very dangerous option in an RDBMS environment to begin with. You could end up with tables with thousands of mostly empty columns, and rails would instantiate objects with thousands of methods on them. .. and what happens when you delete a column ?
In a long run with the approach you are following can make your database very slow.
Your database size will grow as adding columns according to user behavior will leads to null values for other tuples.
It's better you use Document oriented databases. like mongodb, couchdb, cassandra etc.
There's a gem for that. https://github.com/Liooo/dynabute
Also, this question is a duplication of
Rails: dynamic columns/attributes on models?
I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system