In RDBMS, is there a formal design principle for Concrete objects, such as Course vs CourseSession? - mysql

In designing RDBMS schema, I wonder if there is formal principle of concrete objects: for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
But what about a table such as Courses (as in school). It can have a description, number of units, offered only in Autumn (Fall) or Spring, etc, which are the "general properties" of a course.
And then there is actual CourseSessions, which has information about the time_from and time_to (such as 10 to 11am), whether it is Monday, Wednesday or Tue / Thur, and the instructor teaching it, and also pointing back using a course_id to the Courses table.
So the above 2 tables are both needed.
Are there principles of table design for "concrete" vs "abstract"?
Update: what I mean "abstract" here is that a course is an abstract idea... there can be multiple instances of it... such as the course Physics 10 from 10-11am, and another at 12-1pm.

for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
That is the hope, but not the reality of the situation.
By immigration or legal death status, it is possible for there to be two (or more records) that represent the same person. Uniquely identifying people is difficult - first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed.
Are there principles of table design for "concrete" vs "abstract"
The classification of being "concrete" vs "abstract" is arbitrary, subject to interpretation. Does the start and end date really make a Course session "concrete"? Because I can book numerous things in [Calendaring software of choice] - doesn't mean class actually took place, or that final grades are legitimate values...
Table design is based on business rules, and the logical entities (which can become tables in the physical model) required to support those rules. Normalization helps make these entities more obvious.

The relational data model, base on mathematics, prove a way to design your data model on which certain operations is correct without risk.
Unfortunatly, this kind of data model is not a suitable solution for performance issue in database. How to organize tables for certain business domain is need to consider about not only the abstract model of objects or database normalization but also performance planning on your system. Yes, the leak of abstraction.
For example, there are two design strategies for tree structure: Adjacency model and Materialized path model(The art of SQL). Which one is better is based on which operations need to be optimized.
There is a good and classical article I recommend: The Law of Leaky Abstractions
Abstraction has its price (& it is often higher than expected)
By Keith Cooper
The art of SQL, of course, the soul of database design in my opinion.

Related

Database issue: 2 tables with identical structure because of the quality of the data

I have a database with one table where I store two different types of data.
I store a Quote and a Booking in a unique table named Booking.
First, I thought that a quote and a booking is the same since they had the same fields.
But then a quote is not related to a user where a booking is.
We have a lot of quotes in our database which pollutes the table booking with less important data.
I guess it makes sense to have two different tables so they can also evolve independently.
Quote
Booking
The objective is to split the data into junk data (quote) and the actual data (booking).
Does it make sense in the relational-database theory?
I'd start by looking for the domain model to tie this to - is a "quote" the same logical thing as a "booking"? Quotes typically have a different lifecycle to bookings, and bookings typically represent financial commitments. The fact they share some attributes is a hint that they are similar domain concepts, but it's not conclusive. Cars and goldfish share some attributes - age, location, colour - but it's hard to think of them as "similar concepts" at any fundamental level.
In database design, it's best to try to represent the business domain as far as is possible. It makes your code easy to understand, which makes it less likely you'll introduce bugs. It often makes the code simpler, too, which may make it faster.
If you decide they are related in the domain model, it may be a case of trying to model an inheritance hierarchy in the relational database. This question discusses this extensively.

Meta Tables in MySQL

I'm rewriting a system that is currently linked to a MySQL database that is roughly 1GB in size. There are hundreds of thousands of articles, each with a list of contributors (think Wiki style). I've not yet been given access to the existing database schema, but while I wait I've been brainstorming a bit.
Basically, what I'm wondering is if having an article_contributors table would be an efficient way of handling this or if there is a better method to approaching this situation. Considering there are roughly 200,000 articles, if there are 5 contributors on each, that'd be 1,000,000 rows in the meta table.
I'd call that a one-to-many table, not a "meta" table. Or else a multi-valued attribute.
Storing contributors in a separate table, one per row, is the proper way of designing a relational database. There may be other ways to store the data, but they are not relational.
Consider my answer to Is storing a delimited list in a database column really that bad? Storing the contributors as a list in the articles table causes a lot of common SQL queries to break or become horribly inefficient. If you need to do a variety of queries against this data, you will thank yourself for storing it in a normalized fashion.
On the other hand, if you never query anything but the list of contributors as an indivisible unit, then why not store it denormalized (as a list)? That's a valid choice too -- but it depends on how you're going to use the table.
By the way, 1 million rows is not a large MySQL database by some people's standards. This week I'm advising a client who has a table with 900 million rows.
An interesting question!
You're going to need to see the schema to get a straight answer about this. That's because the schema probably embodies some core decisions made by experts in bibliography (reference librarians, etc).
If you try use a join table (articles_contributors) so you can avoid listing a given contributor multiple times when she contributes to multiple articles, you're implicitly declaring that you can create a canonical list of contributors, with a contributor_id for each distinct person.
In the world of bibliography and library science, that sort of list is called a "controlled vocabulary" It's controlled by an "authority." (Read this: http://en.wikipedia.org/wiki/Authority_control) That is, some organization has the responsibility to decide whether this "Jane Smaith" is a different person from that "Jane Smith." That is surprisingly hard to do correctly with people.
For an example of a relatively simple controlled vocabulary, see the "North American Industry Classification System" (NAICS). This has a code for each distinct kind of industry. http://www.census.gov/eos/www/naics/ It's controlled by national committees in three countries. Many bibliographic databases that cover industry include those terms as one of the ways of classifying their contents.
The designers of the system you're soon to take over will have made decisions about these kinds of controlled vocabularies. Will they have one for contributors? You could wait and see, or you could ask. But one thing is sure: the bibliographic designers won't be too delighted if you, on your own authority, create that kind of controlled vocabulary.
The Library of Congress in the USA doesn't attempt to create a controlled list of authors and contributors.
Edit
If you do have a definitive list of contributors, it is a good idea to create a join table articles_contributors as you suggested. You should consider the following columns:
article_id primary key
contributor_id primary key
role primary key values like ("author", "illustrator", "editor", etc)
order 1, 2, 3 so contributors can be listed in proper order.
contact 1 or 0 indicating whether readers should contact this author for more info.

Steps to design a well organized and normalized Relational Database

I just started making a database for my website so I am re-reading Database Systems - Design, Implementation and Management (9th Edition)but i notice there is no single step by step process described in the book to create a well organized and normalized database. The book seems to be a little all over the place and although the normalization process is all in one place the steps leading up to it are not.
I thought it be very usefull to have all the steps in one list but i cannot find anything like that online or anywhere else. I realize the answerer explaining all of the steps would be quite an extensive one but anything i can get on this subject will be greatly appreciated; including the order of instructions before normalization and links with suggestions.
Although i am semi familiar with the process i took a long break (about 1 year) from designing any databases so i would like everything described in detail.
I am especially interested in:
Whats a good approach to begin modeling a database (or how to list business rules so its not confusing)
I would like to use ER or EER (extended entity relationship model) and I would like to know
how to model subtypes and supertypes correctly using EER(disjoint and overlapping) (as well as writing down the business rules for it so you know that its a subtype if there is any common way of doing that)
(I allready am familiar with the normalization process but an answer can include tips about it as well)
Still need help with:
Writing down business rules (including business rules for subtypes and super types in EER)
How to use subtypes and super-types in EER correctly (how to model them)
Any other suggestions will be appreciated.
I would recommend you this videos (about 9) about E/R modeling
http://www.youtube.com/watch?v=q1GaaGHHAqM
EDIT:
"how extensive must the diagrams for this model be ? must they include all the entities and attributes?? "
Yes, actually you have ER modeling and extend ER modeling,
The idea is to make the Extended ER modeling, because there you not only specify the entities, you also specify the PK and FK and the cardinality. Take a look to this link (see the graphics and the difference between both models).
there are two ways of modeling, one is the real scenario and the other one is the real structure of the DB, I.E:
When you create a E-ER Modeling you create even the relationship and cardinality for ALL entities, but when you are going to create the DB is not necessary to create relations with cardinality 1:N(The table with cardinality N create a FK from table with card. 1, and you don't need to create the relation Table into the DB) or when you have a 1:1 cardinality you know that one of your entities can absorb the other entity.
look this Graphic , only the N:M relations entities were create (when you see 2 or more FK, that's a relation table)
But remember those are just "rules" and you can break it if your design need to, for performance, security, etc.
about tools, there are a lot of them, But I recommended workbench, because you can use it to connect to your DBs (if you are in mysql) and create designs E/R modeling, with attributes, and he will auto-create the relations tables N:M.
EDIT 2:
here I put some links that can explain that a little bit better, it will take a lot of lines and will be harder to explain here and by myself, please review this links and let me know if you have questions:
type and subtype:
http://www.siue.edu/~dbock/cmis450/4-eermodel.htm
business rules (integrity constrain)
http://www.deeptraining.com/litwin/dbdesign/FundamentalsOfRelationalDatabaseDesign.aspx (please take a look specially to this one, I think it will help you with all this info)
http://www.google.com/url?sa=t&rct=j&q=database%20design%20integrity%20constraints&source=web&cd=1&ved=0CFYQFjAA&url=http%3A%2F%2Fcs-people.bu.edu%2Frkothuri%2Flect12-constraints.ppt&ei=2aLDT-X4Koyi8gTKhZWnCw&usg=AFQjCNEvXGr7MurxM-YCT0-rU0htqt6yuA&cad=rja
I have reread the book and some articles online and have created a short list of steps in order to design a decent database (of course you need to understand the basics of database design first) Steps are described in greater detail below:
(A lot of steps are described in the book: Database Systems - Design, Implementation and Management (9th Edition) and thats what the page numbers are refering too but i will try to describe as much as I can here and will edit this answer in the following days to make it more complete)
Create a detailed narrative of the organization’s description of operations.
Identify the business rules based from the description of operations.
Identify the main entities and relationships from the business rules.
Translate entities/relationships to EER model
Check naming conventions
Map ERR model to logical model (pg 400)*
Normalize logical model (pg 179)
Improve DB design (pg 187)
Validate Logical Model Integrity Constraints (pg 402) (like length etc.)
Validate the Logical Model against User Requirements
Translate tables to mySQL code (in workbench translate EER to SQL file using export function then to mySQL)
*you can possibly skip this step if you using workbench and work of the ER model that you design there.
1. Describe the workings company in great detail. If you are creating personal project describe it in detail if you are working with a company ask for documents describing their company as well as interviewing the employees for information (interviews might generate inconsistent information make sure to check with supervisers which information is more important for design)
2. Look at the gathered information and start generating rules from them make sure to fill in any information gaps in your knowledge. Confirm with supervisers in the company before moving on.
3. Identify the main entities and relationships from the business rules. Keep in mind that during the design process, the database designer does not depend simply on interviews to help define entities, attributes, and relationships. A surprising amount of information can be gathered by examining the business forms and reports that an organization uses in its daily operations. (pg 123)
4. If the database is complex you can break down the ERD design into followig substeps
i) Create External Models (pg 46)
ii) Combine External Models to form Conceptual Model (pg 48)
Follow the following recursive steps for the design (or for each substep)
I. Develop the initial ERD.
II. Identify the attributes and primary keys that adequately describe the entities.
III. Revise and review the ERD.
IV. Repeat steps until satisfactory output
You may also use entity clustering to further simplify your design process.
Describing database through ERD:
Use solid lines to connect Weak Entities (Weak entities are those which cannot exist without parent entity and contain parents PK in their PK).
Use dashed lines to connect Strong Entities (Strong entities are those which can exist independently of any other entity)
5. Check if your names follow your naming conventions. I used to have suggestions for naming conventions here but people didn't really like them. I suggest following your own standards or looking up some naming conventions online. Please post a comment if you found some naming conventions that are very useful.
6.
Logical design generally involves translating the ER model into a set of relations (tables), columns, and constraints definitions.
Translate the ER to logical model using these steps:
Map strong entities (entities that dont need other entities to exist)
Map supertype/subtype relationships
Map weak entities
Map binary relationships
Map higher degree relationships
7. Normalize the Logical Model. You may also denormalize the logical model in order to gain some desired characteristics. (like improved performance)
8.
Refine Attribute Atomicity -
It is generally good practice to pay attention to the atomicity requirement. An atomic attribute is one that cannot
be further subdivided. Such an attribute is said to display atomicity. By improving the degree of atomicity, you also gain querying flexibility.
Refine Primary Keys as Required for Data Granularity - Granularity refers to the level of detail represented by the values stored in a table’s row. Data stored at their lowest
level of granularity are said to be atomic data, as explained earlier. For example imagine ASSIGN_HOURS attribute to represent the hours worked by a given employee on a given project. However, are
those values recorded at their lowest level of granularity? In other words, does ASSIGN_HOURS represent the hourly
total, daily total, weekly total, monthly total, or yearly total? Clearly, ASSIGN_HOURS requires more careful definition. In this case, the relevant question would be as follows: For what time frame—hour, day, week, month, and
so on—do you want to record the ASSIGN_HOURS data?
For example, assume that the combination of EMP_NUM and PROJ_NUM is an acceptable (composite) primary key
in the ASSIGNMENT table. That primary key is useful in representing only the total number of hours an employee
worked on a project since its start. Using a surrogate primary key such as ASSIGN_NUM provides lower granularity
and yields greater flexibility. For example, assume that the EMP_NUM and PROJ_NUM combination is used as the
primary key, and then an employee makes two “hours worked” entries in the ASSIGNMENT table. That action violates
the entity integrity requirement. Even if you add the ASSIGN_DATE as part of a composite PK, an entity integrity
violation is still generated if any employee makes two or more entries for the same project on the same day. (The
employee might have worked on the project a few hours in the morning and then worked on it again later in the day.)
The same data entry yields no problems when ASSIGN_NUM is used as the primary key.
Try to answer the questions: "Who will be allowed to use the tables and what portion(s) of the table(s) will be available to which users?" ETC.
Please feel free to leave suggestions or links to better descriptions in the comments below i will add it to my answer
One aspect of your question touched on representing subclass-superclass relationships in SQL tables. Martin Fowler discusses three ways to design this, of which my favorite is Class Table Inheritance. The tricky part is arranging for the Id field to propagate from superclasses to subclasses. Once you get that done, the joins you will typically want to do are slick, easy, and fast.
There are six main steps in designing any database :
1. Requirements Analysis
2. Conceptual Design
3. Logical Design
4. Schema Refinement
5. Physical Design
6. Application & Security Design.

Setting up database for 'business_owners' and 'customers'

I'm setting up a database that will have 'business_owners' and 'customers'. I could set this up in a couple days but wanted to see what your opinion is on best practice.
I could have two tables, 'business_owners' and 'customers', each with name, email etc. or...
I could do one table 'Users' and have a user_type as 'business_owner' or 'customer' and just use that type to determine what to show.
I'm thinking the second option is best, any feedback?
Rule of thumb:
If you have more than one table with identical (or near identical) columns, they should be condensed into a single table. Use a type code/etc to distinguish between as necessary, and work out the business rules for columns that depend on the type code.
Answer:
The second option is the best approach. It's the most scalable, and will be the easiest to work with if you ever need to use resultsets that include both business owners & customers.
It depends on the difference between the two types, if they share exactly the same attributes aside from their role as either a 'user' or 'business owner' I would suggest going for the second option to avoid overkill in terms of having identical columns in 2 separate tables.
How would you model this in an object model? Would you set up a single superclass, call it "stakeholders", that captures the properties of both business-owners and customers? Would you then set up specialized subclasses, "business-owner" and "customer" that extend the definition of stakeholders? If so, read on.
Your case looks like an instance of the Gen-Spec design pattern. Gen-spec is familiar to object oriented programmers through the superclass-subclass hierarchy. Unfortunately, introductions to relational database design tend to skip over how to design tables for the Gen-Spec situation. Fortunately, it’s well understood. A web search on “Relational database generalization specialization” will yield several articles on the subject. Some of your hits will be previous questions here on SO. Here is one article that discusses Gen-Spec in terms of Object Relational Mapping.
The trick is in the way the PK for the subclass (specialized) tables gets assigned. It’s not generated by some sort of autonumber feature. Instead, it’s a copy of the PK in the superclass (generalized) table, and is therefore an FK reference to it.
Thus, if the case were vehicles, trucks and sedans, every truck or sedan would have an entry in the vehicles table, trucks would also have an entry in the trucks table, with a PK that’s a copy of the corresponding PK in the vehicles table. Similarly for sedans and the sedan table. It’s easy to figure out whether a vehicle is a truck or a sedan by just doing joins, and you usually want to join the data in that kind of query anyway.

Naming relational tables without getting ridiculous

I have a hierarchical data structure which, as far as I can see, needs to have a series of successive many-to-many relationships.
It goes something like this:
Company
Account
Treaty
Benefit
Policy
Person
With the following relationships:
Company 1---8 Account
Account 1---8 Treaty
...all still fun
And then, many to many:
Treaty 8---8 Benefit, so I create the relational table TreatyBenefit, and do:
Treaty 1---8 TreatyBenefit 8---1 Benefit
Now, for a specific Treaty and a specific Benefit (i.e. a TreatyBenefit) there can be many Policies. But again, a single policy can also fall under multiple TreatyBenefits
So, then I have TreatyBenefit 1---8 TreatyBenefitPolicy 8---1 Policy
And then of course, the same applies to Person, so I also then get:
TreatyBenefitPolicy 1---8 TreatyBenefitPolicyPerson 8---1 Person
What I would like to know is if there are any conventions for naming tables so that you can avoid names that become so long that they are essentially meaningless? Or are there better approaches to the design that avoids this kind of structure entirely?
Thanks
Karl
IMHO unless there are other strong, wideley accepted, meaningful business-centric names for these entities / concepts, then I would stick with the trusted Many:Many mangles that you've described above.
Also, each of the 6 entities you've listed are reasonably concise, so there seems little point in abbreviating e.g. Ben, Per, Pol, Acc, Co etc would cause more confusion than benefit.