First off, I am new to database design so apologies for use of incorrect terminology.
For a university assignment I have been tasked with creating the database schema for a website. Part of the website a user selects the availability of hosting an event but the event can be at any time so for example from 12/12/2015 - 15/12/2015 and 16/01/2016 - 22/12/2016 and also singular dates such as 05/01/2016. They also have the option of having the event all the time
So I am unsure of how to store all these kind of variables in a database table without using a lot of rows. The example below is a basic one that would store each date of availability but that is a lot of records and that is just for one event. Is there a better method of storing these values or would this be stored elsewhere, outside of a database.
calendar_id | event_id | available_date
---------------------------------------
492 | 602 | 12/12/2015
493 | 602 | 13/12/2015
494 | 602 | 14/12/2015
495 | 602 | 15/12/2015
496 | 602 | 05/01/2016
497 | 602 | 16/01/2016
498 | 602 | 17/01/2016
etc...
This definitely requires a database. I don't think you should be concerned about the number of records in a database... that is what databases do best. However, from a university perspective there is something called Normalization. In simple terms normalization is about minimizing data repetition.
Steps to design a schema
Identify entities
As the first step of designing a database schema I tend to identify all the entities in the system. Looking at your example I see (1) Events and (2) EventTimes (event occurrences/bookings) with a one-to-many relation since one Event might have multiple EventTimes. I would suggest that you keep these two entities separate in the database. That way an Event can be extended with more attributes/fields without affecting its EventTimes. Most importantly you can add many EventTimes on an Event without repeating all the event's fields (which would be the case if you use a single table).
Identify attributes
The second step for me is to identify all the attributes/fields of each entity. Additionally, I always suggest an auto-increment id in every table to uniquely identify a row.
Identify constraints
This might be a bit more advanced, but most of the times you have constraints on what is acceptable data values or what uniquely identifies a row in real-life. For example, the Event.id might identify the row in the database but you might also require that each event has a unique title
Example schema
This has to be adjusted to the assignment or, in a real application, to the system's requirements
Events table
id int auto-increment
title varchar unique: Event's title
always_on boolean/enum: If 'Y' then the event is on all the time
... more fields here ... (category, tags, notes, description, venue,...)
EventTimes
id int auto-increment
event_id foreign key pointing to Event.id
start_datetime datetime or int (int if you go for a unix timestamp)
end_datetime : as above
... more fields again... (recursion below is a hard one! avoid it if you can)
recursion enum/int : Is the event repeated? Weekly, Montly, etc
recursion_interval int: Every x days, months, years, etc
A note on date/times, as a rule of thumb whenever you deal with dates and times in a database, always store them in UTC format. You probably don't want/need to mess with timezones in an assignment... but keep it in mind.
Possible extensions to the example
Designing a complete system one might add the tables: Venues, Organizers, Locations, etc... this can go on forever! I do try to think of future requirements when designing but do not over do it cause you end up with a lot of fields that you don't use and increased complexity.
Conclusion
Normalization is something you have to keep in mind when designing a database, however you can see that the more you normalize your schema the more complex will become your selects and joins. There is a trade-off there between data efficiency and query efficiency... That is the reason I used "from a university perspective" earlier. In a real-life system with complex data structures (for example graphs!) you might require to under-normalize the tables to make your queries more efficient/faster or easier. There are other approaches to deal with such issues (functions in the database, temporary/staging tables, views, etc) but always depends on the specific case.
Another really useful thing to keep in mind is: Requirements always change! Design your databases taking as granted that fields will be added/removed, more tables will be added, new constraints will appear, etc and thus make it as extensible and easy to modify as possible... (now we are scratching a bit "Agile" methodologies)
I hope this helps and does not confuse things more. I am not a DBA per-se but I have designed a few schemes. All the above come from experience rather than a book and they may not be 100% accurate. Definitely not the only way to design a database... its kind of an art this job :)
Related
I am having in greatest nightmare on deciding a database schema ! Recently signed up to my first freelancer project.
It has a user registration, and there is pretty decent requirements on user table as follows:
- name
- password
- email
- phone
- is_active
- email_verified
- phone_verified
- is_admin
- is_worker
- is_verified
- has_payment
- last_login
- created_at
Now am at huge confusion to decide whether to put everything under a single table or split things, as still I need to add few more fields like
- token
- otp ( may be in future )
- otp_limit ( may be in future ) // rate limiting
And may be something more in future when there is an update: I am afraid that, if there is an future update with new field to table then how to add that again if it's a single table.
And if I split things will that cause performance issue ? As most of the fields are moderately used on the webapp.
How can I decide?
Your initial aim should be to create a model that is in 3rd Normal Form (3NF). Once you have that, if you then need to move away from a strict 3NF model in order to effectively handle some specific operational requirements/challenges then that's fine - as long as you know what your doing.
A working/simplified definition of whether a model is in 3NF is that all attributes that can be uniquely identified by the same key should be in the same table.
So all attributes of a user should be in the same table (as long as they have a 1:1 relationship with the User ID).
I'm not sure why adding new columns to a table in the future is worrying you - this should not affect a well-designed application. Obviously altering/dropping columns is a different matter.
As commented, design database according to your business or project use case and narrative in mind. Essentially, you need a relational model of Users, Portfolios, and Stocks where Users can have many Portfolios and each Portfolio can contain many Stocks. If you need to track Registrations or Logins, add that to schema where Users can have multiple Registrations or Logins. Doing so, you simply add rows with corresponding UserID and not columns.
Also, consider best practices:
Use Lookup Tables: For static (or rarely changed) data shared across related entities, incorporate lookup tables in relation model like Tickers (with its ID referenced as foreign key in Stocks). Anything that regularly changes at specific level (i.e., user-level) should be stored in that table. Remember database tables should not resemble spreadsheets with repeated static data stored within it.
Avoid Data Elements in Columns: Avoid wide-formatted tables where you store data elements in columns. Tables with hundreds of suffixed or dated columns is indicative of this design. Doing this you avoid clearly capturing Logins data and force a re-design such as ALTER commands for new column with every new instance. Always normalize data for storage, efficiency, and scaling needs.
UserID
Login1
Login2
Login3
...
10001
...
...
...
...
10002
...
...
...
...
10003
...
...
...
...
Application vs Data Centric Design: Depending on your use case, try to not build database with specific application in mind but as a generalized solution for all users including business personnel, CEOs to regular staff, and maybe even data scientists. Therefore, avoid short names, abbreviations (like otp), industry jargon, etc. Everything should be clear and straightforward as much as possible.
Additionally, avoid any application or script that makes structural changes to database like creating temp tables or schemas on the fly. There is a debate if business logic should be contained in database or run in specific application. Usually data should be handled between database and application. Keep in mind , MySQL is a powerful (though free), enterprise, server-level RDBMS and not a throwaway file-level, small scale system.
Maintain Consistent Signature: Pick a naming convention and stick to it throughout the design (i.e., camelcase, snake case, plurals). There is a big debate if you should prefix objects tbl, vw, and sp. One strategy is to name data objects by its content and procedures/functions by its action. Always avoid reserved words and special characters and spaces in names.
Always Document: While very tedious for developers, document every object, functionality, and extension and annotate table and fields for definitions. MySQL supports COMMENTS in CREATE statements for tables and fields. And use # or -- for comments in stored procedures or triggers.
Once designed and in production, databases should rarely (if not ever) be restructured. So carefully think of all possibilities and scenarios beforehand with your use case. Do not dismiss the very important database design step. Good luck!
I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).
I’m developing a database. I’d appreciate some help restructuring 2 to 3 tables so the database is both compliant with the first 3 normal forms; and practical to use and to expand on / add to in the future. I want to invest time now to reduce effort / and confusion later.
PREAMBLE
Please be aware that I'm both a nube, and an amateur, though I have a certain amount of experience and skill and an abundance of enthusiasm!
BACKGROUND TO PROJECT
I am writing a small (though ambitious!) web application (using PHP and AJAX to a MySQL database). It is essentially an inventory management system, for recording and viewing the current location of each individual piece of equipment, and its maintenance history. If relevant, transactions will be very low (probably less than 100 a day, but with a possibility of simultaneous connections / operations). Row count will also be very low (maybe a few thousand).
It will deal with many completely different categories of equipment, eg bikes and lamps (to take random examples). Each unit of equipment will have its details or specifications recorded in the database. For a bike, an important specification might be frame colour, whereas a lamp it might require information regarding lampshade material.
Since the categories of equipment have so little in common, I think the most logical way to store the information is 1 table per category. That way, each category can have columns specific to that category.
I intend to store a list of categories in a separate table. Each category will have an id which is unique to that category. (Depending on the final design, this may function as a lookup table and / or as a table to run queries against.) There are likely to be very few categories (perhaps 10 to 20), unless the system is particulary succesful and it expands.
A list of bikes will be held in the bikes table.
Each bike will have an id which is unique to that bike (eg bike 0001).
But the same id will exist in the lamp table (ie lamp 0001).
With my application, I want the user to select (from a dropdown list) the category type (eg bike).
They will then enter the object's numeric id (eg 0001).
The combination of these two ids is sufficient information to uniquely identify an object.
Images:
Current Table Design
Proposed Additional Table
PROBLEM
My gut feeling is that there should be an “overarching table” that encompasses every single article of equipment no matter what category it comes from. This would be far simpler to query against than god knows how many mini tables. But when I try to construct it, it seems like it will break various normal forms. Eg introducing redundancy, possibility of inconsistency, referential integrity problems etc. It also begins to look like a domain table.
Perhaps the overarching table should be a query or view rather than an entity?
Could you please have a look at the screenshots and let me know your opinion. Thanks.
For various reasons, I’d prefer to use surrogate keys rather than natural keys if possible. Ideally, I’d prefer to have that surrogate key in a single column.
Currently, the bike (or lamp) table uses just the first column as its primary key. Should I expand this to a composite key including the Equipment_Category_ID column too? Then make the Equipment_Article table into a view joining on these two columns (iteratively for each equipment category). Optionally Bike_ID and Lamp_ID columns could be renamed to something generic like Equipment_Article_ID. This might make the query simpler, but is there a risk of losing specificity? It would / could still be qualified by the table name.
Speaking of redundancy, the Equipment_Category_ID in the current lamp or bike tables seems a bit redundant (if every item / row in that table has the same value in that column).
It all still sounds messy! But surely this must be very common problem for eg online electronics stores, rental shops, etc. Hopefully someone will say oh that old chestnut! Fingers crossed! Sorry for not being concise, but I couldn't work out what bits to leave out. Most of it seems relevant, if a bit chatty. Thanks in advance.
UPDATE 27/03/2014 (Reply to #ElliotSchmelliot)
Hi Elliot.
Thanks for you reply and for pointing me in the right direction. I studied OOP (in Java) but wasn't aware that something similar was possible in SQL. I read the link you sent with interest, and the rest of the site/book looks like a great resource.
Does MySQL InnoDB Support Specialization & Generalization?
Unfortunately, after 3 hours searching and reading, I still can't find the answer to this question. Keywords I'm searching with include: MySQL + (inheritance | EER | specialization | generalization | parent | child | class | subclass). The only positive result I found is here: http://en.wikipedia.org/wiki/Enhanced_entity%E2%80%93relationship_model. It mentions MySQL Workbench.
Possible Redundancy of Equipment_Category (Table 3)
Yes and No. Because this is a lookup table, it currently has a function. However because every item in the Lamp or the Bike table is of the same category, the column itself may be redundant; and if it is then the Equipment_Category table may be redundant... unless it is required elsewhere. I had intended to use it as the RowSource / OptionList for a webform dropdown. Would it not also be handy to have Equipment_Category as a column in the proposed Equipment parent table. Without it, how would one return a list of all Equipment_Names for the Lamp category (ignoring distinct for the moment).
Implementation
I have no way of knowing what new categories of equipment may need to be added in future, so I’ll have to limit attributes included in the superclass / parent to those I am 100% sure would be common to all (or allow nulls I suppose); sacrificing duplication in many child tables for increased flexibility and hopefully simpler maintenance in the long run. This is particulary important as we will not have professional IT support for this project.
Changes really do have to be automated. So I like the idea of the stored procedure. And the CreateBike example sounds familiar (in principle if not in syntax) to creating an instance of a class in Java.
Lots to think about and to teach myself! If you have any other comments, suggestions etc, they'd be most welcome. And, could you let me know what software you used to create your UML diagram. Its styling is much better than those that I've used.
Cheers!
You sound very interested in this project, which is always awesome to see!
I have a few suggestions for your database schema:
You have individual tables for each Equipment entity i.e. Bike or Lamp. Yet you also have an Equipment_Category table, purely for identifying a row in the Bike table as a Bike or a row in the Lamp table as a Lamp. This seems a bit redundant. I would assume that each row of data in the Bike table represents a Bike, so why even bother with the category table?
You mentioned that your "gut" feeling is telling you to go for an overarching table for all Equipment. Are you familiar with the practice of generalization and specialization in database design? What you are looking for here is specialization (also called "top-down".) I think it would be a great idea to have an overarching or "parent" table that represents Equipment. Then, each sub-entity such as Bike or Lamp would be a child table of Equipment. A parent table only has the fields that all child tables share.
With these suggestions in mind, here is how I might alter your schema:
In the above schema, everything starts as Equipment. However, each Equipment can be specialized into Lamp, Bike, etc. The Equipment entity has all of the common fields. Lamp and Bike each have fields specific to their own type. When creating an entity, you first create the Equipment, then you create the specialized entity. For example, say we are adding the "BMX 200 Ultra" bike. We first create a record in the Equipment table with the generic information (equipmentName, dateOfPurchase, etc.) Then we create the specialized record, in this case a Bike record with any additional bike-specific fields (wheelType, frameColor, etc.) When creating the specialized entities, we need to make sure to link them back to the parent. This is why both the Lamp and Bike entities have a foreign key for equipmentID.
An easy and effective way to add specialized entities is to create a stored procedure. For example, lets say we have a stored procedure called CreateBike that takes in parameters bikeName, dateOfPurchase, wheelType, and frameColor. The stored procedure knows we are creating a Bike, and therefore can easily create the Equipment record, insert the generic equipment data, create the bike record, insert the specialized bike data, and maintain the foreign key relationship.
Using specialization will make your transactional life very simple. For example, if you want all Equipment purchased before 1/1/14, no joins are needed. If you want all Bikes with a frameColor of blue, no joins are needed. If you want all Lamps made of felt, no joins are needed. The only time you will need to join a specialized table back to the Equipment table is if you want data both from the parent entity and the specialized entity. For example, show all Lamps that use 100 Watt bulbs and are named "Super Lamp."
Hope this helps and best of luck!
Edit
Specialization and Generalization, as mentioned in your provided source, is part of an Enhanced Entity Relationship (EER) which helps define a conceptual data model for your schema. As such, it does not need to be "supported" per say, it is more of a design technique. Therefore any database schema naturally supports specialization and generalization as long as the designer implements it.
As far as your Equipment_Category table goes, I see where you are coming from. It would indeed make it easy to have a dropdown of all categories. However, you could simply have a static table (only contains Strings that represent each category) to help with this population, and still keep your Equipment tables separate. You mentioned there will only be around 10-20 categories, so I see no reason to have a bridge between Equipment and Equipment_Category. The fewer joins the better. Another option would be to include an "equipmentCategory" field in the Equipment table instead of having a whole table for it. Then you could simply query for all unique equipmentCategory values.
I agree that you will want to keep your Equipment table to guaranteed common values between all children. Definitely. If things get too complicated and you need more defined entities, you could always break entities up again down the road. For example maybe half of your Bike entities are RoadBikes and the other half are MountainBikes. You could always continue the specialization break down to better get at those unique fields.
Stored Procedures are great for automating common queries. On top of that, parametrization provides an extra level of defense against security threats such as SQL injections.
I use SQL Server. The diagram I created is straight out of SQL Server Management Studio (SSMS). You can simply expand a database, right click on the Database Diagrams folder, and create a new diagram with your selected tables. SSMS does the rest for you. If you don't have access to SSMS I might suggest trying out Microsoft Visio or if you have access to it, Visual Paradigm.
I'm making my first site with Django, and I'm having a database design problem.
I need to store some of the users history, and I don't know whether it's better to create a table like this for each user every time one signs up:
table: $USERNAME$
id | some_data | some_more | even_more
or have one massive table from the start, with everyone's data in:
table: user_history
id | username | some_data | some_more | even_more
I know how to do the second one, just declare it in my Django models. If I should do the first one, how can I in Django?
The first one organises the data more hierarchically but could potentially create a lot of tables depending on the popularity of the service (is this a bad thing?)
The second one seems to more suit Django's design philosophies (from what I've seen so far), and would be easier to run comparative searches between users, but could get huge (number of users * average items in history). Can MySQL handle, say, 1 billion records? (I won't get that, but it's good to plan ahead)
Definitely the second format is the way you want to go. MySQL is pretty good at handling large numbers of rows (assuming they're indexed and cached as appropriate, of course). For example, all versions of all pages on Wikipedia are stored on one table in their database, and that works absolutely fine.
I just don't know what Django is, but I'm sure it's not a good practice to create a table per user for logging, (or almost anything, for that matter).
Best regards.
You should definitely store all users in one table, one row per user. It's the only way you can filter out data using a WHERE clause. And I'm not sure if MySQL can handle 1 billion records, but I've never found the records limit as a limiting factor. I wouldn't worry about the records limit for now.
You see, every high-loaded project started with something that was just well-designed. Well designed system has better perspectives of being improved to handle huge loads.
Also keep in mind, that even genious guys in twitter/fb/etc did not know what issues they will experience after a while. And you will not know either. Solving loading/scalability challenges and their prediction is a sort of rocket-science.
So the best you can do now - is just starting with the most normalized db and academic solutions, and solve the bottlenecks as soon as they will appear.
When creating a relational database, you would only want to create a new table if it contains significantly different data than the original table. In this case, all of the tables will be pretty much the same, so you would only want 1 table for all users.
If you want to break it down even further, you may not want to store all the users actions in the user table. You may want to have 1 table for user information, and another for user history, ie:
table: User
Id | UserName | Password | other data
table: User_history
Id | some_data | timestamp
There's no need to be worried about the speed of your database as long as you define proper indexes on the fields you plan to search. Using those indexes will definitely speed up your response time as more records are put into your table. The database I work on has several tables with 30,000,000+ records and there's no slow-down.
Definitely DO NOT create a TABLE per user. Create a row per user, and possibly a row per user and smaller tables if some data can be factored.
definitely stick with one table for all users, consider complicated queries that may request extra resources for running on multiple tables instead of just one.
run some tests, regarding resources i am sure you will find out one table works best.
Everyone has pointed out that the second option is the way to go, I'll add my +1 to that.
About the first option, in Django, you create tables by declaring subclasses of django.models.Model and then when you run the management command syncdb it will look at all the models and create missing tables for all "managed" models. It might be possible to invoke this behavior at run time, but it isn't the way things are done.
I have a MySQL DB containing entry for pages of a website.
Let's say it has fields like:
Table pages:
id | title | content | date | author
Each of the pages can be voted by users, so I have two other tables
Table users:
id | name | etc etc etc
Table votes:
id | id_user | id_page | vote
Now, I have a page where I show a list of the pages (10-50 at a time) with various information along with the average vote of the page.
So, I was wondering if it were better to:
a) Run the query to display the pages (note that this is already fairly heavy as it queries three tables) and then for each entry do another query to calculate the mean vote (or add a 4th join to the main query?).
or
b) Add an "average vote" column to the pages table, which I will update (along with the vote table) when an user votes the page.
nico
Use the database for what it's meant for; option a is by far your best bet. It's worth noting that your query isn't actually particularly heavy, joining three tables; SQL really excels at this sort of thing.
Be cautious of this sort of attempt at premature optimization of SQL; SQL is far more efficient at what it does than most people think it is.
Note that another benefit from using your option a is that there's less code to maintain, and less chance of data diverging as code gets updated; it's a lifecycle benefit, and they're too often ignored for miniscule optimization benefits.
You might "repeat yourself" (violate DRY) for the sake of performance. The trade-offs are (a) extra storage, and (b) extra work in keeping everything self-consistent within your DB.
There are advantages/disadvantages both ways. Optimizing too early has its own set of pitfalls, though.
Honestly, for this issue, I would recommend redundent information. Multiple votes for multiple pages can really create a heavy load for a server, in my opinion. If you forsee to have real traffic on your website, of course... :-)