Building Organizations with Subgroups - mysql

I'm struggling with a database design issue, and it's kind of a long winded one:
My website will have an unlimited number of organizations users they can join, subgroups under those organizations, and finally specific profiles for those subgroups. Subgroups within the same organization will be able to borrow and make changes to profiles from each other. Users will generate the organizations, the subgroups, and profiles.
I can draw it out, make the flow sensible on paper. When it comes to actually putting it to either SQL I'm lost. The majority of the help guides out there assumes static groups so a simple primary and foreign key set up can refer back to the right information. Mine has too much dynamic information for most of these to outright work as I understand it.
Most writers say stay away from dynamically generated tables, but that's where my instinct takes me. Another idea I had was 3 massive tables one for all Organizations, Groups, and Profiles.
So is there a better way to go about this? Or are there any good documents I should read up on to help me translate from drawing to actual code?
I have some experience with both SQL and MongoDB if that helps explain things.

I don't know about MongoDB(NoSQL), but from the SQL standpoint, here is my opinion.
As far as your schema goes, Most of the time when your "instinct" indicates that :- Only a "Dynamic Tables" solution is your best bet, for some problem that you are working on.
Remember there is a high chance that, that very problem can be solved by multiple static tables with different relationships. (By Static I mean the ones which you have created yourself as a developer.)
Also I'd like to mention that, I too myself in my initial days always thought of problem solving the similar way, but then I started understanding the principles and how exactly the databases work.
Back To Your Problem:-
If your organisation hirerchy consists of three major types of objects/levels, viz. Organizations, Groups, and Profiles then I'd suggest that you go with the 3 tables with correct relationships, which any SQL engine is quiet efficient at handling, in comparison to creating tables at runtime.
Now if the hierarchy is dynamic like say, An organisation can contain many groups which in turn shall contain profiles which again shall/can contain other organisations and so on.... Then you may want to look at Recursive structure with SQL(Recursion). (Just do a google search there are a lot of articles about that.)

Related

Best Practice for a database that continues to expand with more tables

I am a Ruby on Rails application developer and have helped with the creation of a database "college_a", containing information about the college I work for.
Initially, "college_a" contained tables for apps, apps_auths, and apps_roles. Over the past years it has expanded to contain tables for employees, job info, and supervisors ... all in the "college_a" database. I am looking at saving data about email accounts and am wondering if there are any drawbacks to continuing to add more tables to "college_a" or should I start looking at creating a new database representing data on email accounts which will need to reference data from the "college_a.people" table.
Not having experience with the maintenance/usage of an ever expanding database or needing to query multiple databases to gather needed data, is why I am asking if there are any negatives from continuing to add more tables to "college_a" database?
Thank you.
Well, there is no problem in keep adding tables, as long as they are well designed and justified. In other words, I don't think there is such thing as too much tables, but you must make an effort to keep it well designed.
This means there shouldn't be any redundancy, you should try to keep the database normalized, you should try to keep a documentation about it, etc.
I found this article that enumerates some database mistakes that I think you may find helpful.

MySQL Relational Database with Large Data Sets Unique to Each User

I am working on a project which involves building a social network-style application allowing users to share inventory/product information within their network (for sourcing).
I am a decent programmer, but I am admittedly not an expert with databases; even more so when it comes to database design. Currently, user/company information is stored via a relational database scheme in MySQL which is working perfectly.
My problem is that while my relational scheme works brilliantly for user/company information, it is confusing me on how to implement inventory information. The issue is that each "inventory list" will definitely contain differing attributes specific to the product type, but identical to the attributes of each other product in the list. My first thought was to create a table for each "inventory list". However, I feel like this would be very messy and would complicate future attempts at KDD. I also (briefly) considered using a 'master inventory' and storing the information (e.g. the variable categories and data as a JSON string. But I figured JSON strings MySQL would just become a larger pain in the ass.
My question is essentially how would someone else solve this problem? Or, more generally, sticking with principles of relational database management, what is the "correct" way to associate unique, large data sets of similar type with a parent user? The thing is, I know I could easily jerry-build something that would work, but I am genuinely interested in what the consensus is on how to solve this problem.
Thanks!
I would check out this post: Entity Attribute Value Database vs. strict Relational Model Ecommerce
The way I've always seen this done is to make a base table for inventory that stores universally common fields. A product id, a product name, etc.
Then you have another table that has dynamic attributes. A very popular example of this is Wordpress. If you look at their data model, they use this idea heavily.
One of the good things about this approach is that it's flexible. One of the major negatives is that it's slow and can produce complex code.
I'll throw out an alternative of using a document database. In that case, each document can have a different schema/structure and you can still run queries against them.

'Many to two' relationship

I am wondering about a 'many to two' relationship. The child can be linked to either of two parents, but not both. Is there any way to reinforce this? Also I would like to prevent duplicate entries in the child.
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This question shows that you don't fully understand entity relationships (no rudeness intended). Of which there are four (technically only 3) types below:
One to One
One to Many
Many to One
Many to Many
One to One (1:1):
In this case a table has been broken up into two parts for purposes of complying with normalisation, or more usually the open closed principle.
Normalisation compliance: You might have a business rule that each customer has only one account. Technically, you could in this case say customer and account could all be in the same table, but this breaks the rules of normalisation, so you split them and make a 1:1.
Open-Close principle compliance: A customer table, might have id, first & last names, and address. Later someone decides to add a date of birth and with it the ability to calculate age along with a bunch of other much needed fields. This is an over simplified example of one to one, but you get the main use for it is to extend your database without breaking existing code. Much code written (sadly) is tightly coupled to the database so changes in the structure of a table will break the code. Adding a 1:1 like this will extend the table to meet new requirements without modifying the origional, thereby allowing old code to continue functioning normally and new code to make use of the new db features.
The downside of normalisation and extending tables using 1:1 relationships in this way is performance. Often times on heavly used systems, the first target to increase database performance is de-normalising and combining such tables into a single table, and optimising the indexes thus removing the need to use joins and read from multiple tables. Normalisation / De-Normalisation is neither a good or bad thing, as it depends on the needs of the system. Most systems usually start off normalised changing back when needed, but this change needs to be done very carefully as mentioned, if code is tightly coupled to the DB structure, it will almost definitely cause the system to fail. i.e. When you combine 2 tables, one ceases to exist, all the code that includes that now nonexistant table fails until it is modified (in db terms, imagine connecting relationships to any of the tables in the 1:1, when you remove those tables, this breaks the relationships, and so the structure has to be greatly modified to compensate. Unfortunately, such bad designs are much easier to spot in the DB world than in the software world in most cases and you don't usually notice something went wrong in code until it all falls apart) unless the system is properly designed with separation of concerns in mind.
It the closest thing you can get to inheritance in object oriented programming. But its not quite the same.
One to Many (1:M) / Many to One (M:1):
These two relationships (hense why 4 become 3), are the most popular relationship types. They are both the same type of relationship, the only thing that changes is your point of view. An example A customer has many phone numbers, or alternately, many phone numbers can belong to a customer.
In object oriented programming this would be considered composition. Its not inheritance, but you are saying one item is composed of many parts. This is usually represented with arrays / lists / collections etc. inside of classes as opposed to an inheritance structure.
Many to Many (M:M):
This type of relationship with current technology is impossible. For this reason we need to break it down into two one to many relationships with an "association" table joining them. The many side of the two one to many relationships is always on the association / link table.
For your example, the person who said you need a many to many is correct. Because a two to many is effectively a many (meaning more than one) to many relationship. This is the only way you would get your system to work. Unless you are intending to research the field of relational calculus to find some new type of relationship that would allow this.
Also for such relationships (m2m) you have two choices, either create a compound key in the linker table so the combination of fields become a unique entry (if you are interested in db optimisation this is the slower choice, but takes less space). Alternately, you create a third field with an auto generated id column and make that the primary key (for db optimisation, this is the faster choice, but takes more space).
In your example specifically above...
A real world example would be phone numbers, users and companies. A company can have many phone numbers, a user can have many phone numbers, but ideally the user shouldn't provide the same phone number as the company as there would be duplicate content in the DB.
This would be a many to many relationship with the phone number table as the linker table between companies and users. As explained, to ensure no phone number is repeated, you simply set it as the primary key or use another primary key and set the phone number field to unique.
For those kind of questions, it is really down to how you phrase them. What is causing you to get confused about this, and how you overcome this confusion to see the solution is simple. Rephrase the problem as follows. Start by asking is it a one to one, if the answer is no, move on. Next ask is it a one to many, if the answer is no move on. The only other option remaining is many to many. Be careful though, ensure you have considered the first 2 questions carefully before moving on. Many inexperienced database people often over complicate issues by defining one to many as many to many. Once again, the most popular type of relationship by far is one to many (I would say 90%) with the many to many and one to one spliting the remaining 10% 7/3 respectevely. But those figures are just my personal perspective, so dont go quoting them as industry standard statistics. My point is to make extra extra sure it is definitely not a one to many before choosing many to many. It is worth the extra effort.
So now to find the linker table between the two, decide which two are your main tables, and what fields need to be shared between them. In this case, company and user tables both need to share the phone. Hense you need to make a new phone table as the linker.
The warning alarm of misunderstanding should show as soon as you decide none of the 3 are working for you. This should be enough to tell you that you simply are not phrasing the relationship question correctly. You will get better at it as time passes, but it is an essential skill and really should be mastered as soon as possible for your own sanaty.
Of course you could also go to an object oriented database which will allow a range of other relationships called "Hierarchacal" relationships. Thats great if you are thinking of becomming a programmer too. But I wouldnt recommend this as it going to make your head hurt when you start finding ways to combine the various types of relationships. Especially given there is not much need since nearly all databases in the world consist of just those 3 types of relationships unless they are something super duper special.
Hope this was a reasonable answer. Thanks for taking the time to read it.
Just make phone number a key in your contact numbers table.
For your phone number example, you would put the phone number in a table by itself, with an ID.
Then you link to that phone_id from each of users and companies.
For your parents example, you don't link the child to parent - instead you link the parent to the child. OR, you put both parents in the same table, and the child just links to one of them.

How do you know when you need separate tables?

How do you know when to create a new table for very similar object types?
Example:
To learn mysql I'm building a model solar system. For the purposes of my project, planets have many similar attributes to dwarf planets, centaurs, and comets. Dwarf planets are almost completely identical to planets. Centaurs and comets are only different from planets because their orbital path has more variation. Should I have a separate table for each type of object, or should they share tables?
The example is probably too simple, but I'm also interested in best practices. Like should I use separate tables just in case I want to make planets and dwarf planets different in the future, or are their any efficiency reasons for keeping them in the same table.
Normal forms is what you should be interested with. They pretty much are the convention for building tables.
Any design that doesn't break the first, second or third normal form is fine by me. That's a pretty long list of requirement though, so I suggest you go read it off the Wikipedia links above.
It depends on what type of information you want to store about the objects. If the information for all of them is the same, say orbit radius, mass and name, then you can use the same table. However, if there are different properties for each (say atmosphere composition for planets, etc.) then you can either use separate tables for each (not very normalized) or have one table for basic properties like orbit, mass and name and a second table for just the properties that are unique to planets (and a similar table for comets, etc. if needed). All objects would be in the first table but only planets would be in the second table and linked through a foreign key to the first table.
It's called Database Normalization
There are many normal forms. By applying normalization you will go through metadata (tables) and study the relationsships between data more clearly. By using the normalization techniques you will optimize the tables to prevent redundancy. This process will help you understand which entities to create based on the relationsships between the different fields.
You should most likely split the data about a planet etc so that the shared (common) information is in another table.
E.g.
Common (Table)
Diameter (Column)
Mass (Column)
Planet
Population
Comet
Speed
Poor columns I know. Have the Planet and Comet tables link to the Common data with a key.
This is definitely a subjective question. It sounds like you are already on the right lines of thinking. I would ask:
Do these objects share many attributes? If so, it's probably worth considering at the very least a base table to list them all in.
Does one object "extend" another - it has all the attributes of the other, plus some extras? If so, it might be worth adding another table with the extra attributes and a one-to-one mapping back to the base object.
Do both objects have many shared attributes and unshared attributes? If this is the case, maybe you need a single table plus a "data extension" system where each object can have a type or category that specifies any amount of extra attributes that may be associated with it.
Do the objects only share one or two attributes? In this case, they are probably dissimilar enough to separate into multiple tables.
You may also ask yourself how you are going to query the data. Will you ever want to get them all in the same list? It's always a good idea to combine data into tables with other data they will commonly be queried with. For example, an "attachments" table where the file can be an image or a video, instead of images and video tables, if you commonly want to query for all attachments. Don't split into multiple tables unless there is a really good reason.
If you will ever want to get planets and comets in one single query, they will pretty much have to be in the same table if you want the database to work efficiently. Inheritance should be handled inside your app itself :)
Here's my answer to a similar question, which I think applies here as well:
How do you store business activities in a SQL database?
There are many different ways to express inheritance in your relational model. For example you can try to squish everything in to one table and have a field that allows you to distinguish between the different types or have one table for the shared attributes with relationships to a child table with the specific attributes etc... in either choice you're still storing the same information. When going from a domain model to a relational model this is what is called an impedance mismatch. Both choices have different trade offs, for example one table will be easier to query, but multiple tables will have higher data density.
In my experience it's best not to try to answer these questions from a database perspective, but let your domain model, and sometimes your application framework of choice, drive the table structure. Of course this isn't always a viable choice, especially when performance is concerned.
I recommend you start by drawing on paper the relationships you want to express and then go from there. Does the table structure you've chosen represent the domain accurately? Is it possible to query to extract the information you want to report on? Are the queries you've written complicated or slow? Answering these questions and others like them will hopefully guide you towards creating a good relational model.
I'd also suggest reading up on database normalization if you're serious about learning good relational modeling principals.
I'd probably have a table called [HeavenlyBodies] or some such thing. Then have a look up table with the type of body, ie Planet, comet, asteroid, star, etc. All will share similar things such as name, size, weight. Most of the answers I read so far all have good advise. Normalization is good, but I feel you can take it too far sometimes. 3rd normal is a good goal.

First-time database design: am I overengineering? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Background
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
The schema I have come up with is the following:
From the client perspective:
Clients is the main table
Clients are linked to the department they work for
Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
Departments are linked to the division of a company
Divisions are linked to the parent company
From the classes perspective:
Sessions is the main table
A teacher is linked to each session
A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
Sessions are grouped into "packs" of an arbitrary size
Each packs is assigned to a client
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me: (Click here for full-sized graphic)
(source: maian.org)
Example queries I'll be running
Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division
Question(s)
Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
Thanks for your time
Some more answers to your questions:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
What you can do is in the Departments table, add CityId and DivisionId.
Besides that, I think everything is fine...
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
No. It looks like you're designing at a good level of detail.
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
Regards.
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
I must say your design is impressive for a first year CS student.
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
A few things came to mind:
The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.)
Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers.
The "packs" concept isn't clear at all.
Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.