Spreading/distributing an entity into multiple tables instead of a single on - mysql

Why would anyone distribute an entity (for example user) into multiple tables by doing something like:
user(user_id, username)
user_tel(user_id, tel_no)
user_addr(user_id, addr)
user_details(user_id, details)
Is there any speed-up bonus you get from this DB design? It's highly counter-intuitive, because it would seem that performing chained joins to retrieve data sounds immeasurably worse than using select projection..
Of course, if one performs other queries by making use only of the user_id and username, that's a speed-up, but is it worth it? So, where is the real advantage and what could be a compatible working scenario that's fit for such a DB design strategy?
LATER EDIT: in the details of this post, please assume a complete, unique entity, whose attributes do not vary in quantity (e.g. a car has only one color, not two, a user has only one username/social sec number/matriculation number/home address/email/etc.. that is, we're not dealing with a one to many relation, but with a 1-to-1, completely consistent description of an entity. In the example above, this is just the case where a single table has been "split" into as many tables as non-primary key columns it had.

By splitting the user in this way you have exactly 1 row in user per user, which links to 0-n rows each in user_tel, user_details, user_addr
This in turn means that these can be considered optional, and/or each user may have more than one telephone number linked to them. All in all it's a more adaptable solution than hardcoding it so that users always have up to 1 address, up to 1 telephone number.
The alternative method is to have i.e. user.telephone1 user.telephone2 etc., however this methodology goes against 3NF ( http://en.wikipedia.org/wiki/Third_normal_form ) - essentially you are introducing a lot of columns to store the same piece of information
edit
Based on the additional edit from OP, assuming that each user will have precisely 0 or 1 of each tel, address, details, and NEVER any more, then storing those pieces of information in separate tables is overkill. It would be more sensible to store within a single user table with columns user_id, username, tel_no, addr, details.
If memory serves this is perfectly fine within 3NF though. You stated this is not about normal form, however if each piece of data is considered directly related to that specific user then it is fine to have it within the table.
If you later expanded the table to have telephone1, telephone2 (for example) then that would violate 1NF. If you have duplicate fields (i.e. multiple users share an address, which is entirely plausible), then that violates 2NF which in turn violates 3NF
This point about violating 2NF may well be why someone has done this.

The author of this design perhaps thought that storing NULLs could be achieved more efficiently in the "sparse" structure like this, than it would "in-line" in the single table. The idea was probably to store rows such as (1 , "john", NULL, NULL, NULL) just as (1 , "john") in the user table and no rows at all in other tables. For this to work, NULLs must greatly outnumber non-NULLs (and must be "mixed" in just the right way), otherwise this design quickly becomes more expensive.
Also, this could be somewhat beneficial if you'll constantly SELECT single columns. By splitting columns into separate tables, you are making them "narrower" from the storage perspective and lower the I/O in this specific case (but not in general).
The problems of this design, in my opinion, far outweigh these benefits.

Related

how to manage common information between multiple tables in databases

this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.
i have some information that are common between multiple tables like:
Document information (can be used initially in FOLDER and CONTRACT
tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).
To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:
ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA
for example we can retrieve the information about a document with the following query:
SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";
we can also use the following index to optimize the query:
CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);
My questions are:
is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can
have many documents, so i create a folder_document table with the
folder_id as a foreign key. and do the same for all the tables.
Any suggestions or notes are very welcomed and thank you in advance for the help.
What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.
The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.
So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.
Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.
Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.
If you can't decide which of your queries deserves priority, or you can't predict whether you will have other new queries in the future, then you should stick with a normalized design.
There's no way anyone on Stack Overflow can know your queries better than you do.
Case 1: status
"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.
Case 2: address
"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.
However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.
One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.
Discussion
Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)
Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:
2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)
Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.
"many-to-many" requires an extra table for connecting the two tables.
ENUM
Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.
ID
id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.

Which is better, using a central ID store or assigning IDs based on tables

In many ERP Systems (Locally) I have seen that Databases (Generally MYSQL) uses central key store (Resource Identity). Why is that?
i.e. In a database one special table is maintained for generation of IDs which will have one cell (first one) which will have a number (ID) which is assigned to the subsequent tuple (i.e. common ID generation for all the tables in the same database).
Also in this table the entry for last inserted batch details are inserted. i.e. when 5 tuples in table ABC is inserted and, lets say that last ID of item in the batch is X, then an entry in the table (the central key store) is also inserted like ('ABC', X).
Is there any significance of this architecture?
And also where can I find the case study of common large scale custom built ERP system?
If I understand this correctly, you are asking why would someone replace IDs that are unique only for a table
TABLE clients (id_client AUTO_INCREMENT, name, address)
TABLE products (id_product AUTO_INCREMENT, name, price)
TABLE orders (id_order AUTO_INCREMENT, id_client, date)
TABLE order_details (id_order_detail AUTO_INCREMENT, id_order, id_product, amount)
with global IDs that are unique within the whole database
TABLE objects (id AUTO_INCREMENT)
TABLE clients (id_object, name, address)
TABLE products (id_object, name, price)
TABLE orders (id_object, id_object_client, date)
TABLE order_details (id_object, id_object_order, id_object_product, amount)
(Of course you could still call these IDs id_product etc. rather than id_object. I only used the name id_object for clarification.)
The first approach is the common one. When inserting a new row into a table you get the next available ID for the table. If two sessions want to insert at the same time, one must wait briefly.
The second approach hence leads to sessions waiting for their turn everytime they want to insert data, no matter what table, as they all get their IDs from the objects table. The big advantage is that when exporting data, you have global references. Say you export orders and the recipient tells you: "We have problems with your order details 12345. There must be something wrong with your data". Wouldn't it be great, if you could tell them "12345 is not an order detail ID, but a product ID. Do you have problems importing the product or can you tell me an order detail ID this is about?" rather than looking at an order detail record for hours that happens to have the number 12345, while it had nothing to do with the issue, really?
That said, it might be a better choice to use the first approach and add a UUID to all tables you'd use for external communication. No fight for the next ID and still no mistaken IDs in communication :-)
This is the common strategy used in datawarehouse to track the the batch number after successful or failure of dataloading, in case the loading of the data got failed you will say something like 'ABC' , 'Batch_num' and 'Error_Code' in the batch control table, so your further logic of loading can decide on what do with failure and can easily track the loading, in case if you want to recheck we can archive the data. This ID's are usually generated by a sequence in data base, in one word it is mostly used for monitoring purposes.
You can refer this link for more details
There are several more techniques, each with pros and cons. But let me start by pointing out two techniques that hit a brick wall at some point in scaling up. Let's assume you have billions of items, probably scattered across multiple server either by sharding or other techniques.
Brick wall #1: UUIDs are handy because clients can create them without having to ask some central server for values. But UUIDs are very random. This means that, in most situations, each reference incurs a disk hit because the id is unlikely to be cached.
Brick wall #2: Ask a central server, which has an AUTO_INCREMENT under the covers to dole out ids. I watched a social media site that was doing nothing but collecting images for sharing crash because of this. That's in spite of there being a server whose sole purpose is to hand out numbers.
Solution #1:
Here's one (of several) solutions that avoids most problems: Have a central server that hands out 100 ids at a time. After a client uses up the 100 it has been given, it asks for a new batch. If the client crashes, some of the last 100 are "lost". Oh, well; no big deal.
That solution is upwards of 100 times as good as brick wall #2. And the ids are much less random than those for brick wall #1.
Solution #2: Each client can generate its own 64-bit, semi-sequential, ids. The number includes a version number, some of the clock, a dedup-part, and the client-id. So it is roughly chronological (worldwide), and guaranteed to be unique. But still have good locality of reference for items created at about the same time.
Note: My techniques can be adapted for use by individual tables or as an uber-number for all tables. That distinction may have been your 'real' question. (The other Answers address that.)
The downside to such a design is that it puts a tremendous load on the central table when inserting new data. It is a built-in bottleneck.
Some "advantages" are:
Any resource id that is found anywhere in the system can be readily identified, regardless of type.
If there is any text in the table (such as a name or description), then it is all centralized facilitating multi-lingual support.
Foreign key references can work across multiple types.
The third is not really an advantage, because it comes with a downside: the inability to specify a specific type for foreign key references.

MySQL: database structure choice - big data - duplicate data or bridging

We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...

Does it cause problems to have a table associated with multiple content types?

I have multiple content types, but they all share some similarities. I'm wondering when it is a problem to use the same table for a different content type? Is it ever a problem? If so, why?
Here's an example: I have five kinds of content, and they all have a title. So, can't I just use a 'title' table for all five content types?
Extending that example: a title is technically a name. People and places have names. Would it be bad to put all of my content titles, people names, and place names in a "name" table? Why separate into place_name, person_name, content_title?
I have different kinds of content. In the database, they seem very similar, but the application uses the content in different ways, producing different outputs. Do I need a new table for each content type because it has a different result with different kinds of dependencies, or should I just allow null values?
I wouldn't do that.
If there are multiple columns that are the same among multiple tables, you should indeed normalize these to 1 table.
And example of that would be several types of users, which all require different columns, but all share some characteristics (e.g. name, address, phone number, email address)
These could be normalized to 1 table, which is then referenced to by all other tables through a foreign key. (see http://en.wikipedia.org/wiki/Database_normalization )
Your example only shows 1 common column, which is not worth normalizing. It would even reduce performance trying to fetch your data, because you'll need to join 2 tables to get all data; 1 of which (the one with the titles) contains a lot of data you won't need all the data from, thus straining the server more.
While normalization is a very good practice to avoid redundency and ensure consistency, it can be bad for performance sometimes. For example for a person table where you have columns like name, adress, dob its not very good performance wise to have a picture in the same table. A picture can be about 1MB easily while the remaining columns may not take any more than 1K. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If there is a variation in size of the contents and you might have to retrieve only certain types of contents in the same query, the performance gain from storing them in separate tables will outweight the normalization easily.
To typify data in this way, it's best to use a table (i.e., name), and a sub-table (i.e., name_type), and then use a FK constraint. Use an FK constraint because the InnoDB does not support column constraints, and the MyISAM engine is not suited for this (it is much less robust and feature rich, and it should really only be used for performance).
This kind of normailization is fine, but it should be done with a free-format column type, like VARCHAR(40), rather than with ENUM. Use triggers to restrict the input so that it matches the types you want to support.

When is it a good idea to move columns off a main table into an auxiliary table?

Say I have a table like this:
create table users (
user_id int not null auto_increment,
username varchar,
joined_at datetime,
bio text,
favorite_color varchar,
favorite_band varchar
....
);
Say that over time, more and more columns -- like favorite_animal, favorite_city, etc. -- get added to this table.
Eventually, there are like 20 or more columns.
At this point, I'm feeling like I want to move columns to a separate
user_profiles table is so I can do select * from users without
returning a large number of usually irrelevant columns (like
favorite_color). And when I do need to query by favorite_color, I can just do
something like this:
select * from users inner join user_profiles using user_id where
user_profiles.favorite_color = 'red';
Is moving columns off the main table into an "auxiliary" table a good
idea?
Or is it better to keep all the columns in the users table, and always
be explicit about the columns I want to return? E.g.
select user_id, username, last_logged_in_at, etc. etc. from users;
What performance considerations are involved here?
Don't use an auxiliary table if it's going to contain a collection of miscellaneous fields with no conceptual cohesion.
Do use a separate table if you can come up with a good conceptual grouping of a number of fields e.g. an Address table.
Of course, your application has its own performance and normalisation needs, and you should only apply this advice with proper respect to your own situation.
I would say that the best option is to have properly normalized tables, and also to only ask for the columns you need.
A user profile table might not be a bad idea, if it is structured well to provide data integrity and simple enhancement/modification later. Only you can truly know your requirements.
One thing that no one else has mentioned is that it is often a good idea to have an auxiliary table if the row size of the main table would get too large. Read about the row size limits of your specific databases in the documentation. There are often performance benefits to having tables that are less wide and moving the fields you don't use as often off to a separate table. If you choose to create an auxiliarary table with a one-to-one relationship make sure to set up the PK/FK relationship to maintain data integrity and set a unique index or constraint on the FK field to mainatin the one-to-one relationship.
And to go along with everyone else, I cannot stress too strongly how bad it is to ever use select * in production queries. You save a few seconds of development time and create a performance problem as well as make the application less maintainable (yes less - as you should not willy nilly return things you may not want to show on the application but you need in the database. You will break insert statements that use selects and show users things you don't want them to see when you use select *.).
Try not to get in the habit of using SELECT * FROM ... If your application becomes large, and you query the users table for different things in different parts of your application, then when you do add favorite_animal you are more likely to break some spot that uses SELECT *. Or at the least, that place is now getting unused fields that slows it down.
Select the data you need specifically. It self-documents to the next person exactly what you're trying to do with that code.
Don't de-normalize unless you have good reason to.
Adding a favorite column ever other day every time a user has a new favorite is a maintenance headache at best. I would highly consider creating a table to hold a favorites value in your case. I'm pretty sure I wouldn't just keep adding a new column all the time.
The general guideline that applies to this (called normalization) is that tables are grouped by distinct entities/objects/concepts and that each column(field) in that table should describe some aspect of that entity
In your example, it seems that favorite_color describes (or belongs to) the user. Some times it is a good idea to moved data to a second table: when it becomes clear that that data actually describes a second entity. For example: You start your database collecting user_id, name, email, and zip_code. Then at some point in time, the ceo decides he would also like to collect the street_address. At this point a new entity has been formed, and you could conceptually view your data as two tables:
user: userid, name, email
address: steetaddress, city, state, zip, userid(as a foreign key)
So, to sum it up: the real challenge is to decide what data describes the main entity of the table, and what, if any, other entity exists.
Here is a great example of normalization that helped me understand it better
When there is no other reason (e.g. there are normal forms for databases) you should not do it. You dont save any space, as the data must still stored, instead you waste more as you need another index to access them.
It is always better (though may require more maintenance if schemas change) to fetch only the columns you need.
This will result in lower memory usage by both MySQL and your client application, and reduced query times as the amount of data transferred is reduced. You'll see a benefit whether this is over a network or not.
Here's a rule of thumb: if adding a column to an existing table would require making it nullable (after data has been migrated etc) then instead create a new table with all NOT NULL columns (with a foreign key reference to the original table, of course).
You should not rely on using SELECT * for a variety of reasons (google it).