Naming Conventions for Multivariable Dependency Tables MySQL - mysql

Conventions for normalized databases rule that the best practice for dealing with multivariable dependencies is spinning them off into their own table with two columns. One column is the primary key of the original table (for example, customer name, of which there is one), while the other is the value with has multiple values (for example, email or phone- the customer could have multiple of these). Together these two columns constitute the primary key for the spun off table.
However, when building normalized databases, I often find naming these spun off tables troublesome. It's hard to come up with a meaningful names for these tables. Is there a standard way of identifying these tables as multivariable dependency tables that are meaningless without the presence of the other table? Some examples I can think of (referencing the example above) are 'customer_phones' or 'customer_has_phones'. I don't think just 'phones' would be good, because that doesn't identify this table as related to and heavily dependent on the customers table.

In real life you end up running into a lot of combinations that vary a lot from each other.
Try to be as clear as possible in case someone else ends up inheriting your design. I personally like to keep short names in the parent tables so they don't end up being super long whenever the relationship grows or spans off new children.
For instance, if I have "Customer", "Subscriptions", "Product" tables I would end up naming their links like "Customer_Subscriptions" or "Subscriptions_Products" and such.
Most of the time it just gets down to what works better for you in terms of maintainability.

The convention we use is the name of the entity table, followed by the name of the attribute.
In your example, if the entity table is customer, the name of the table for the repeating (multi-valued) attribute would be customer_phone or customer_phone_number. (We almost always name tables in the singular, based on the idea that we are naming what ONE tuple (row) represents. (e.g. a row in that table represents one occurrence of a phone number for a customer.)

Related

MySQL: database structure choice - big data - duplicate data or bridging

We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...

Proper design for a table with subset of columns null per row

I will start my question with the abstract case and then I will also give a concrete example in case it helps.
Assuming I have a tableX with columns A,B,C,D,E,F.
A,B,F are required.
Now we can have a record with C,D populated (so E is null) or a record with E populated (so C,D are null).
Is this table normalized or properly designed? I am not sure if this relations/expectations among these columns as I described should be "captured" differently.
Example:
A table to be used by a message processor where either the actual msg to get/process is stored in column E OR the url and the protocol to fetch the message to process are stored in columns C and D
Normally tables that store Class Hierarchy (super class and sub-classes together) require a separate discriminator column. In your case each of the three columns - C,D or E - can be used as such, so an additional column is required.
Such data organization offers best performance for simple queries.
If you split it into 3 separate tables (super class and its two sub-classes) you will get a normalized model. I believe in your case it does not make sense, as long as you have just these three nullable columns.
If your example is a simplified presentation of your real data model and your sub-classes differ substantially, then normalization will be more economical in storage space and offer faster execution for queries that rely solely on super class data.
The table is probably not properly normalized. It sounds like there are two types of entities being stored in the table -- the A,B,C,D,F entity and the A,B,E,F entity.
Does this make the schema bad? Probably not. Relational databases use primary keys to connect one table to another. If other tables can connect to either type of entity, then it makes sense to store them in a single table. This allows one single key to connect them. You could, of course, introduce a three table schema (one for each subentity and one for the parent entity). This could be overkill when the entities are really quite similar.
Your example is a fine example. This sounds like a control table for a process that can do one of two things. It makes sense that different columns are used for each type processing.

MySQL - When to have one to one relationships

When should one use one to one relationships? When should you add new fields and when should you separate them into a new table?
It seems to me that you'd use it whenever you're grouping fields and/or that group tends to be optional. Yes?
I'm trying to create the tables for an object but grouping/separating everything would require me about 20 joins and some even 4 levels deep.
Am I doing something wrong? How can I improve?
First, I highly recommend reading about Normal Forms
A normalized relational database is extremely useful, and doing this properly is the reason tools such as Hibernate exist - to help manage the difference between objects-represented-as-relational-mappings and objects-as-progrommatic-entities.
Anything that has a one-to-one mapping should probably be in the same table. A Person has only one first name, one last name. Those should logically be in the same table. Having a reference to a table of names isn't necessary - in particular because little additional data can be stored about a name. Obviously, this isn't always true (an etymology database might want to do exactly that), but for most uses, you don't care about where a name comes from - indeed all you want is the name.
Therefore, think of the objects being represented. A person has some singular data points, and some one-to-many relationships (addresses they have lived, for instance). One to many and many to many will almost always require a separate table (or two, to have many to many). Following those two guidelines, you can get a normalized database pretty fast.
Note that optional fields should be avoided if at all possible. Usually this is a case of having a separate table holding the field with a reference back to the original table. Try to keep your tables lean. If a field isn't likely to have something, it probably should be a row in it's own table. Many such properties suggests a 'Property' table that can hold arbitrary optional properties of a particular type (ie, as are applied to a 'Person').

Does it cause problems to have a table associated with multiple content types?

I have multiple content types, but they all share some similarities. I'm wondering when it is a problem to use the same table for a different content type? Is it ever a problem? If so, why?
Here's an example: I have five kinds of content, and they all have a title. So, can't I just use a 'title' table for all five content types?
Extending that example: a title is technically a name. People and places have names. Would it be bad to put all of my content titles, people names, and place names in a "name" table? Why separate into place_name, person_name, content_title?
I have different kinds of content. In the database, they seem very similar, but the application uses the content in different ways, producing different outputs. Do I need a new table for each content type because it has a different result with different kinds of dependencies, or should I just allow null values?
I wouldn't do that.
If there are multiple columns that are the same among multiple tables, you should indeed normalize these to 1 table.
And example of that would be several types of users, which all require different columns, but all share some characteristics (e.g. name, address, phone number, email address)
These could be normalized to 1 table, which is then referenced to by all other tables through a foreign key. (see http://en.wikipedia.org/wiki/Database_normalization )
Your example only shows 1 common column, which is not worth normalizing. It would even reduce performance trying to fetch your data, because you'll need to join 2 tables to get all data; 1 of which (the one with the titles) contains a lot of data you won't need all the data from, thus straining the server more.
While normalization is a very good practice to avoid redundency and ensure consistency, it can be bad for performance sometimes. For example for a person table where you have columns like name, adress, dob its not very good performance wise to have a picture in the same table. A picture can be about 1MB easily while the remaining columns may not take any more than 1K. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If there is a variation in size of the contents and you might have to retrieve only certain types of contents in the same query, the performance gain from storing them in separate tables will outweight the normalization easily.
To typify data in this way, it's best to use a table (i.e., name), and a sub-table (i.e., name_type), and then use a FK constraint. Use an FK constraint because the InnoDB does not support column constraints, and the MyISAM engine is not suited for this (it is much less robust and feature rich, and it should really only be used for performance).
This kind of normailization is fine, but it should be done with a free-format column type, like VARCHAR(40), rather than with ENUM. Use triggers to restrict the input so that it matches the types you want to support.

Foreign key column optionally contains NULL or ID. Is there a better design?

I'm working on a database that holds answers from a questionnaire for companies.
In the table that holds the bulk of the answers I have a column (ie techDir) that indicates whether there is technical director. If the company has a director then it's populated with an ID referencing a "people" table, else it holds "null".
Another design that has come to mind is the "techDir" column holding a Boolean value, leaving the look-up in the "people" table to the software logic and adding a column in the "people" table indicating the role of the person.
Which of the two designs is better? Is there generally a better design that I have not thought of?
I would say that if there is a relatively small amount of NULL values, then using NULLs would be okay. However, if you find that most rows contain NULLs, then you might be better off deleting the techDir column and placing a column referencing the "Answers" into a new table alongside another field referencing the "People" table. In other words, create an intermediate table between the Answers table and the People table containing all technical directors as shown below.
This will get rid of all the NULL values and also allow for more flexibility. If there is only one Technical Director per answer then simply make the column referencing the answers table "Unique" to create a One-to-One relationship. If you need more than one technical director, create a One-to-Many relationship as shown. Another advantage to this design is that it simplifies the query if you ever want to extract all the technical directors. I generally use a simple rule of thumb when deciding whether to use NULL values or not. If I see the table contains lots of NULLS, I remove those columns and create a new table where I can store that data. You should of course also consider the types of queries you will be executing. For example, the design above might require an Inner or Outer Join to view all the rows including the technical directors. As a developer, you should carefully weigh up the pros and cons and look at things like flexibility, speed, complexity and your business rules when making these decisions.
Logically, if there is no director, there should be NULL.
In bussiness logic, you would have a reference to a Director object there, if there is no director, there should also be null instead of the reference.
Using a boolean in fear of additional performance loss due to longer query time looks very much like premature optimisation.
Also there are joins that are optimized to do that efficiently in one query, so no additional lookups are necessary.
You could argue that it depends on how many people have a director, so you could save a little space when only 1 in a million entries has one, depeding on the datatype you use. But in the end, clearest (and best) option is to indeed make a foreign key that allows for NULL, like you proposed in the first option.
I think the null for that column is ok. As far as I remember from my DB class at uni (long time ago), null is an excellent choice to represent "I don't know" or "it doesn't have".
I think the second design has the following flaw: You didn't mentioned how to look up for the techdir of a specific question, you said that you just tag the person. Another problem might be that if in the future you add another role, the schema won't support it.
NULL is the most common way of indicating no relationship in an optional relationship.
There is an alternative. Decompose the table into two tables, one of which is has two foreign keys, back to the original table and forward to the related table. In cases where there is no relationship, just omit the entire row.
If you want to understand this in terms of normalization, look up "Sixth Normal form" (6NF). Not all experts are in agreement about 6NF.