how to manage common information between multiple tables in databases - mysql

this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.
i have some information that are common between multiple tables like:
Document information (can be used initially in FOLDER and CONTRACT
tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).
To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:
ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA
for example we can retrieve the information about a document with the following query:
SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";
we can also use the following index to optimize the query:
CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);
My questions are:
is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can
have many documents, so i create a folder_document table with the
folder_id as a foreign key. and do the same for all the tables.
Any suggestions or notes are very welcomed and thank you in advance for the help.

What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.
The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.
So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.
Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.
Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.
If you can't decide which of your queries deserves priority, or you can't predict whether you will have other new queries in the future, then you should stick with a normalized design.
There's no way anyone on Stack Overflow can know your queries better than you do.

Case 1: status
"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.
Case 2: address
"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.
However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.
One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.
Discussion
Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)
Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:
2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)
Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.
"many-to-many" requires an extra table for connecting the two tables.
ENUM
Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.
ID
id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.

Related

Would this style of relating of tables work?

Would the following relationships between the tables work out?
There are over 4000 rows for Airline Data, 150k rows for RAW DATA and
about 2000 rows for Airports.
I cannot create a primary key for RAW DATA because there are many repeated values.
http://i108.photobucket.com/albums/n32/lurker3345/ACCESSHELP-1.png
The relationships look fine. I assume many things -- for starters, that the data types match where they are linked. The diagram doesn't communicate much, and there could be many reasons why the schema shown is not optimal.
You certainly can create a PK for RAW DATA, and you had better because it is voluminous.
A common approach is to select multiple fields to serve as the key because together they obtain a unique value. This is called a compound key. It's helpful (even essential) because it naturally ensures the unique combination is not unintentially duplicated. (In most situations you will want to make sure all key fields are set to not allow a zero-length or null entry.)
There is a simpler approach that serves many situations. Maybe you don't need this kind of data integrity, or you aren't sure yet what would make up a compound key, or you just want to get a provisional PK in place. Merely add an autonumber field and declare that as PK.
Some developers take that easy approach and accomplish data validation outside the table...and some ignore data validation needs, which can result in a disaster.
Once you have the PK declared, making sure the table has indexes on critical fields (in addition to the PK) is important for efficiency.
Really, before all else, do yourself a favor and rename all tables and fields so there are no spaces. While at it, rethink every name and try for most descriptive and standardized name possible. Access is cruel when it comes to renaming things later on. Avoiding spaces is a practice that will help you greatly further down the road.

To merge the table or not for performance/centralisation

I have been working on my database and the thought occurred to me that maybe it would be better to combine two of my tables to better organise the data and perhaps get performance benefits (or not?).
I have two tables that contain addresses, delivery and the other invoice, their structure is identical. One table contains invoice addresses and the other contains delivery.
What would be the implications of merging these together into one table simply called "addresses", and create a new column called addressTypeId? This new column references a new table that contains address types like delivery, invoice, home etc.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
By the way I am using INNODB.
If you are missing the appropriate indexes, then the look up performance will drop by a factor of two (if you are merging two equally sized tables). However, if you are missing indexes, you likely don't care about the performance.
Lookup using a hashed index is constant-time. Lookup using a tree index is logarithmic, so the effect is small. Writes to a tree index are logarithmic as well and writes to a hash map are amortized constant.
don't suffer from premature optimization!!!
A good design is more important than peak performance. Address lookup is likely not your bottleneck. A bad code resulting from a bad database design far outweighs any benefits. If you make two tables, you are going to duplicate code. Code duplication is a maintainance nightmare.
Merge the tables. You will be thankful when you need to extend your application in the near future. You could want to make more address types. You could want to add common functionality to the addresses (formatting). Your customers will not notice the extra milisecond from traversing one more level of a binary tree. They will notice you have a hard time adding an extra feature and they will notice inconsistencies arising from code duplication.
You might even gain performance by merging the tables. While you might need to traverse an extra node in a tree, the tree might be more likely to be cached in memory and not need disk access. Disk access is expensive. You might reduce disk access by merging.
As #BenP.P.Tung already said, you don't need an extra table for an enumeration. Use an enumeration type.
If you just need to distinguish the address difference. I suggest what you need is a ENUM column in this merged table. If it is exist, you can add a new column like following,
alter table add addressTypes ENUM('delivery','invoice','home') DEFAULT NULL;
Or DEFAULT invoice something you think should be default when you can not get the required information.
Don't need to put all enum values at a time. Just what you needed now, and add more value in the future as following.
alter table change addressTypes addressTypes ENUM('delivery','invoice','home','office') DEFAULT NULL;
One table will work fine. If there is a performance concern, then add the address type column to the primary index at the start of the index. This will avoid any performance issues until you have a very large number of addresses.
their structure is identical.
Are their constraints identical as well?1
If yes, merge the addresses.
If no, keep them separate.
Constraints are as much part of the table as are its fields.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
Do you query both kinds of addresses in the same way?
If yes, it shouldn't matter either way (assuming you indexed correctly).
If not, then different tables enable you to index or cluster your data differently.
Related posts:
Data modeling for Same tables with same columns
Two tables with same columns or one table with additional column?
1 For example, are both delivery and invoice supposed to be able to reference (through foreign keys) the same address? Are PKs of addresses supposed to be unique for all addresses or just for addresses of particular type? Are there any CHECKs that exist for one address type and not for the other? Etc, etc...

Spreading/distributing an entity into multiple tables instead of a single on

Why would anyone distribute an entity (for example user) into multiple tables by doing something like:
user(user_id, username)
user_tel(user_id, tel_no)
user_addr(user_id, addr)
user_details(user_id, details)
Is there any speed-up bonus you get from this DB design? It's highly counter-intuitive, because it would seem that performing chained joins to retrieve data sounds immeasurably worse than using select projection..
Of course, if one performs other queries by making use only of the user_id and username, that's a speed-up, but is it worth it? So, where is the real advantage and what could be a compatible working scenario that's fit for such a DB design strategy?
LATER EDIT: in the details of this post, please assume a complete, unique entity, whose attributes do not vary in quantity (e.g. a car has only one color, not two, a user has only one username/social sec number/matriculation number/home address/email/etc.. that is, we're not dealing with a one to many relation, but with a 1-to-1, completely consistent description of an entity. In the example above, this is just the case where a single table has been "split" into as many tables as non-primary key columns it had.
By splitting the user in this way you have exactly 1 row in user per user, which links to 0-n rows each in user_tel, user_details, user_addr
This in turn means that these can be considered optional, and/or each user may have more than one telephone number linked to them. All in all it's a more adaptable solution than hardcoding it so that users always have up to 1 address, up to 1 telephone number.
The alternative method is to have i.e. user.telephone1 user.telephone2 etc., however this methodology goes against 3NF ( http://en.wikipedia.org/wiki/Third_normal_form ) - essentially you are introducing a lot of columns to store the same piece of information
edit
Based on the additional edit from OP, assuming that each user will have precisely 0 or 1 of each tel, address, details, and NEVER any more, then storing those pieces of information in separate tables is overkill. It would be more sensible to store within a single user table with columns user_id, username, tel_no, addr, details.
If memory serves this is perfectly fine within 3NF though. You stated this is not about normal form, however if each piece of data is considered directly related to that specific user then it is fine to have it within the table.
If you later expanded the table to have telephone1, telephone2 (for example) then that would violate 1NF. If you have duplicate fields (i.e. multiple users share an address, which is entirely plausible), then that violates 2NF which in turn violates 3NF
This point about violating 2NF may well be why someone has done this.
The author of this design perhaps thought that storing NULLs could be achieved more efficiently in the "sparse" structure like this, than it would "in-line" in the single table. The idea was probably to store rows such as (1 , "john", NULL, NULL, NULL) just as (1 , "john") in the user table and no rows at all in other tables. For this to work, NULLs must greatly outnumber non-NULLs (and must be "mixed" in just the right way), otherwise this design quickly becomes more expensive.
Also, this could be somewhat beneficial if you'll constantly SELECT single columns. By splitting columns into separate tables, you are making them "narrower" from the storage perspective and lower the I/O in this specific case (but not in general).
The problems of this design, in my opinion, far outweigh these benefits.

For storing people in MySQL (or any DB) - multiple tables or just one?

Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.

Foreign key column optionally contains NULL or ID. Is there a better design?

I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?
I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.
Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.
You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.
I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.
NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.