I have a hierarchical data structure which, as far as I can see, needs to have a series of successive many-to-many relationships.
It goes something like this:
Company
Account
Treaty
Benefit
Policy
Person
With the following relationships:
Company 1---8 Account
Account 1---8 Treaty
...all still fun
And then, many to many:
Treaty 8---8 Benefit, so I create the relational table TreatyBenefit, and do:
Treaty 1---8 TreatyBenefit 8---1 Benefit
Now, for a specific Treaty and a specific Benefit (i.e. a TreatyBenefit) there can be many Policies. But again, a single policy can also fall under multiple TreatyBenefits
So, then I have TreatyBenefit 1---8 TreatyBenefitPolicy 8---1 Policy
And then of course, the same applies to Person, so I also then get:
TreatyBenefitPolicy 1---8 TreatyBenefitPolicyPerson 8---1 Person
What I would like to know is if there are any conventions for naming tables so that you can avoid names that become so long that they are essentially meaningless? Or are there better approaches to the design that avoids this kind of structure entirely?
Thanks
Karl
IMHO unless there are other strong, wideley accepted, meaningful business-centric names for these entities / concepts, then I would stick with the trusted Many:Many mangles that you've described above.
Also, each of the 6 entities you've listed are reasonably concise, so there seems little point in abbreviating e.g. Ben, Per, Pol, Acc, Co etc would cause more confusion than benefit.
Related
I'm rewriting a system that is currently linked to a MySQL database that is roughly 1GB in size. There are hundreds of thousands of articles, each with a list of contributors (think Wiki style). I've not yet been given access to the existing database schema, but while I wait I've been brainstorming a bit.
Basically, what I'm wondering is if having an article_contributors table would be an efficient way of handling this or if there is a better method to approaching this situation. Considering there are roughly 200,000 articles, if there are 5 contributors on each, that'd be 1,000,000 rows in the meta table.
I'd call that a one-to-many table, not a "meta" table. Or else a multi-valued attribute.
Storing contributors in a separate table, one per row, is the proper way of designing a relational database. There may be other ways to store the data, but they are not relational.
Consider my answer to Is storing a delimited list in a database column really that bad? Storing the contributors as a list in the articles table causes a lot of common SQL queries to break or become horribly inefficient. If you need to do a variety of queries against this data, you will thank yourself for storing it in a normalized fashion.
On the other hand, if you never query anything but the list of contributors as an indivisible unit, then why not store it denormalized (as a list)? That's a valid choice too -- but it depends on how you're going to use the table.
By the way, 1 million rows is not a large MySQL database by some people's standards. This week I'm advising a client who has a table with 900 million rows.
An interesting question!
You're going to need to see the schema to get a straight answer about this. That's because the schema probably embodies some core decisions made by experts in bibliography (reference librarians, etc).
If you try use a join table (articles_contributors) so you can avoid listing a given contributor multiple times when she contributes to multiple articles, you're implicitly declaring that you can create a canonical list of contributors, with a contributor_id for each distinct person.
In the world of bibliography and library science, that sort of list is called a "controlled vocabulary" It's controlled by an "authority." (Read this: http://en.wikipedia.org/wiki/Authority_control) That is, some organization has the responsibility to decide whether this "Jane Smaith" is a different person from that "Jane Smith." That is surprisingly hard to do correctly with people.
For an example of a relatively simple controlled vocabulary, see the "North American Industry Classification System" (NAICS). This has a code for each distinct kind of industry. http://www.census.gov/eos/www/naics/ It's controlled by national committees in three countries. Many bibliographic databases that cover industry include those terms as one of the ways of classifying their contents.
The designers of the system you're soon to take over will have made decisions about these kinds of controlled vocabularies. Will they have one for contributors? You could wait and see, or you could ask. But one thing is sure: the bibliographic designers won't be too delighted if you, on your own authority, create that kind of controlled vocabulary.
The Library of Congress in the USA doesn't attempt to create a controlled list of authors and contributors.
Edit
If you do have a definitive list of contributors, it is a good idea to create a join table articles_contributors as you suggested. You should consider the following columns:
article_id primary key
contributor_id primary key
role primary key values like ("author", "illustrator", "editor", etc)
order 1, 2, 3 so contributors can be listed in proper order.
contact 1 or 0 indicating whether readers should contact this author for more info.
I've found a few questions on modelling many-to-many relationships, but nothing that helps me solve my current problem.
Scenario
I'm modelling a domain that has Users and Challenges. Challenges have many users, and users belong to many challenges. Challenges exist, even if they don't have any users.
Simple enough. My question gets a bit more complicated as users can be ranked on the challenge. I can store this information on the challenge, as a set of users and their rank - again not too tough.
Question
What scheme should I use if I want to query the individual rank of a user on a challenge (without getting the ranks of all users on the challenge)? At this stage, I don't care how I make the call in data access, I just don't want to return hundreds of rank data points when I only need one.
I also want to know where to store the rank information; it feels like it's dependent upon both a user and a challenge. Here's what I've considered:
The obvious: when instantiating a Challenge, just get all the rank information; slower but works.
Make a composite UserChallenge entity, but that feels like it goes against the domain (we don't go around talking about "user-challenges").
Third option?
I want to go with number two, but I'm not confident enough to know if this is really the DDD approach.
Update
I suppose I could call UserChallenge something more domain appropriate like Rank, UserRank or something?
The DDD approach here would be to reason in terms of the domain and talk with your domain expert/business analyst/whoever about this particular point to refine the model. Don't forget that the names of your entities are part of the ubiquitous language and need to be understood and used by non-technical people, so maybe "UserChallenge" is not he most appropriate term here.
What I'd first do is try to determine if that "middle entity" deserves a place in the domain model and the ubiquitous language. For instance, if you're building a website and there's a dedicated Rankings page where the user he can see a list of all his challenges with the associated ranks, chances are ranks are a key matter in the application and a Ranking entity will be a good choice to represent that. You can talk with your domain expert to see if Rankings is a good name for it, or go for another name.
On the other hand, if there's no evidence that such an entity is needed, I'd stick to option 1. If you're worried about performance issues, there are ways of reducing the multiplicity of the relationship. Eric Evans calls that qualifying the association (DDD, p.83-84). Technically speaking, it could mean that the Challenge has a map - or a dictionary of ranks with the User as a key.
I would go with Option 2. You don't have to "go around talkin about user-challenges", but you do have to go around grabbin all them Users for a given challenge and sorting them by rank and this model provides you a great way to do it!
I'm setting up a database that will have 'business_owners' and 'customers'. I could set this up in a couple days but wanted to see what your opinion is on best practice.
I could have two tables, 'business_owners' and 'customers', each with name, email etc. or...
I could do one table 'Users' and have a user_type as 'business_owner' or 'customer' and just use that type to determine what to show.
I'm thinking the second option is best, any feedback?
Rule of thumb:
If you have more than one table with identical (or near identical) columns, they should be condensed into a single table. Use a type code/etc to distinguish between as necessary, and work out the business rules for columns that depend on the type code.
Answer:
The second option is the best approach. It's the most scalable, and will be the easiest to work with if you ever need to use resultsets that include both business owners & customers.
It depends on the difference between the two types, if they share exactly the same attributes aside from their role as either a 'user' or 'business owner' I would suggest going for the second option to avoid overkill in terms of having identical columns in 2 separate tables.
How would you model this in an object model? Would you set up a single superclass, call it "stakeholders", that captures the properties of both business-owners and customers? Would you then set up specialized subclasses, "business-owner" and "customer" that extend the definition of stakeholders? If so, read on.
Your case looks like an instance of the Gen-Spec design pattern. Gen-spec is familiar to object oriented programmers through the superclass-subclass hierarchy. Unfortunately, introductions to relational database design tend to skip over how to design tables for the Gen-Spec situation. Fortunately, it’s well understood. A web search on “Relational database generalization specialization” will yield several articles on the subject. Some of your hits will be previous questions here on SO. Here is one article that discusses Gen-Spec in terms of Object Relational Mapping.
The trick is in the way the PK for the subclass (specialized) tables gets assigned. It’s not generated by some sort of autonumber feature. Instead, it’s a copy of the PK in the superclass (generalized) table, and is therefore an FK reference to it.
Thus, if the case were vehicles, trucks and sedans, every truck or sedan would have an entry in the vehicles table, trucks would also have an entry in the trucks table, with a PK that’s a copy of the corresponding PK in the vehicles table. Similarly for sedans and the sedan table. It’s easy to figure out whether a vehicle is a truck or a sedan by just doing joins, and you usually want to join the data in that kind of query anyway.
I have a site with users that I want users to be able to identify their ethnicities. What's the best way to model this if there is only 1 level of hierarchy?
Solution 1 (single table):
Ethnicity
- Id
- Parent Id
- Name
Solution 2 (two tables):
Ethnicity Group
- Id
- Name
Ethnicity
- Id
- Ethnicity Group Id
- Name
I will be using this so that users can search for other users based on ethnicity. Which of the 2 approaches will work better for me? Is there another approach I have not considered? I'm using MySQL.
Well there is such a thing as an Ethnicity Group in the real world, so you do need two tables, not one. The real world has three levels (the top-most would be Race), but I understand that may not be necessary here. If you squash the three levels into two, you have to be careful, and lay them all out properly at the beginning. However, they will be vulnerable to people saying they want the real thing, and you may have to change it, or change the structure to fit more in ... much more work later).
If you do it correctly, as per real world, that problem is eliminated. Let me know if you want Race, and I will change the model.
The tables are far too small, and the keys are too meaningful, to add Id-iot columns to them; leave them as pure Relational keys, otherwise you will lose the power of the Relational engine. If you really want narrow keys, use a CHAR(2) EthnicityCode, rather than a NUMERIC(10,0) or a meaningless number.
Link to Ethnicity Data Model (plus the answer to your other question)
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
If there is nothing like an "ethnicity group" in the real world, I'd suggest you don't introduce one in your data model.
All the queries you can do with the second one you can also do with the first one, because you can just select FROM ethnicity AS e1 JOIN ethnicity AS es ON (e2.ethnicity_id = e1.parent_id).
I don't want to be awkward, but what are you going to do with people of mixed descent? I think that the best that you can hope for is a simple single-level enumeration like the kind of thing you get on census forms (e.g. 'Black', 'White', 'Asian', 'Hispanic' etc). It's not ideal, but it allows people to fairly easily self-identify. Concepts like race and ethnicity are wooly enough without trying to create additional (largely meaningless) hierarchies on top of them, so my gut feeling is to keep it simple.
In designing RDBMS schema, I wonder if there is formal principle of concrete objects: for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
But what about a table such as Courses (as in school). It can have a description, number of units, offered only in Autumn (Fall) or Spring, etc, which are the "general properties" of a course.
And then there is actual CourseSessions, which has information about the time_from and time_to (such as 10 to 11am), whether it is Monday, Wednesday or Tue / Thur, and the instructor teaching it, and also pointing back using a course_id to the Courses table.
So the above 2 tables are both needed.
Are there principles of table design for "concrete" vs "abstract"?
Update: what I mean "abstract" here is that a course is an abstract idea... there can be multiple instances of it... such as the course Physics 10 from 10-11am, and another at 12-1pm.
for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
That is the hope, but not the reality of the situation.
By immigration or legal death status, it is possible for there to be two (or more records) that represent the same person. Uniquely identifying people is difficult - first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed.
Are there principles of table design for "concrete" vs "abstract"
The classification of being "concrete" vs "abstract" is arbitrary, subject to interpretation. Does the start and end date really make a Course session "concrete"? Because I can book numerous things in [Calendaring software of choice] - doesn't mean class actually took place, or that final grades are legitimate values...
Table design is based on business rules, and the logical entities (which can become tables in the physical model) required to support those rules. Normalization helps make these entities more obvious.
The relational data model, base on mathematics, prove a way to design your data model on which certain operations is correct without risk.
Unfortunatly, this kind of data model is not a suitable solution for performance issue in database. How to organize tables for certain business domain is need to consider about not only the abstract model of objects or database normalization but also performance planning on your system. Yes, the leak of abstraction.
For example, there are two design strategies for tree structure: Adjacency model and Materialized path model(The art of SQL). Which one is better is based on which operations need to be optimized.
There is a good and classical article I recommend: The Law of Leaky Abstractions
Abstraction has its price (& it is often higher than expected)
By Keith Cooper
The art of SQL, of course, the soul of database design in my opinion.