MySQL: database structure choice - big data - duplicate data or bridging - mysql

We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:

The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)

if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...

Related

Database: 70+ Columns or Multiple Tables?

Building a database system for my local Medical Association.
What we have is a list with something like 70+ fields of information for each member of the association. Stuff like name, surname, home address, office address, phone numbers, specialty +++ many small details.
At the moment i've built one table with all the information related to the docs + other tables for related stuff like subscription payments, requests, penalties etc.
I'm quite new to database design, and while it works, I find my design ugly. It is logical, as all information in each row is unique and belongs to just one person, but i'm sure there would be a better way to do it.
How would you go for it? Should I do multiple 1:1 tables, 1 for each subject (basic info, contact info, education, etc) or just keep it as it is? One table with 70+ columns.
I wouldn't worry about 70 columns in a table. This is not a problem for MySQL.
MySQL can support many more columns. InnoDB's hard limit on the number of columns in a table is 1000.
Read this blog about Understanding the Maximum Number of Columns in a MySQL Table for details.
It's more convenient to put all the attributes into a table that belong with that table. It will take more coding work to support separating columns into multiple tables if you feel you need to do that.
If some of the columns are not applicable, use NULL. NULL takes almost no storage in MySQL, so you won't be "wasting" any space by having a lot of columns most of which are NULL.
The only downside is that you may find yourself adding more columns as time goes on, and that could be inconvenient if the table grows large and access to the table is blocked while you are altering it. In that case, learn to use pt-online-schema-change, a free tool that allows you to alter tables while continuing to use them.
1:1 is rarely wise. But it may be advisable if you have "too many" columns, especially if they are "too bulky".
Do use suitable datatypes for the columns -- ints, dates, etc.
Do use suitable VARCHAR sizes, not blindly VARCHAR(255) or TEXT. (This will help later in certain subtle limits and in performance.)
Study the data. If, for example, only half the persons have a "subscription", then the 5 columns relating to a subscription can (should) be moved to a separate table. Probably the subscription table would have a person_id for linking to the main table. This is potentially 1:many (one person to many subscriptions) if that is relevant.
By splitting off some columns, you avoid lots of NULLs. Nulls are not harmful; it just seems sloppy if there are lots of nulls.
If you are talking about only a few hundred rows, then you are unlikely to encounter significant performance issues regardless of how you structure the tables.
"phone numbers" used to come in "home" and "work". Now there is "fax", "cell", "emergency contact", and even multiple numbers of any category. That is very likely a 1-person-to-many-numbers.
Selectively "normalize" the data. You said "local". It may be worth "normalizing" (city, state, zip) into a separate table. Or it may not be worth the effort. I argue that you should no normalize phone numbers.
Do not have an "array" of things splayed across columns. Definitely use a separate table when such occurs.
Do think about 1:1, 1:many, and many:many when building "entity" tables and their "relationships".
If you have a million rows, these tips become more important and need to be pondered carefully.

Spreading/distributing an entity into multiple tables instead of a single on

Why would anyone distribute an entity (for example user) into multiple tables by doing something like:
user(user_id, username)
user_tel(user_id, tel_no)
user_addr(user_id, addr)
user_details(user_id, details)
Is there any speed-up bonus you get from this DB design? It's highly counter-intuitive, because it would seem that performing chained joins to retrieve data sounds immeasurably worse than using select projection..
Of course, if one performs other queries by making use only of the user_id and username, that's a speed-up, but is it worth it? So, where is the real advantage and what could be a compatible working scenario that's fit for such a DB design strategy?
LATER EDIT: in the details of this post, please assume a complete, unique entity, whose attributes do not vary in quantity (e.g. a car has only one color, not two, a user has only one username/social sec number/matriculation number/home address/email/etc.. that is, we're not dealing with a one to many relation, but with a 1-to-1, completely consistent description of an entity. In the example above, this is just the case where a single table has been "split" into as many tables as non-primary key columns it had.
By splitting the user in this way you have exactly 1 row in user per user, which links to 0-n rows each in user_tel, user_details, user_addr
This in turn means that these can be considered optional, and/or each user may have more than one telephone number linked to them. All in all it's a more adaptable solution than hardcoding it so that users always have up to 1 address, up to 1 telephone number.
The alternative method is to have i.e. user.telephone1 user.telephone2 etc., however this methodology goes against 3NF ( http://en.wikipedia.org/wiki/Third_normal_form ) - essentially you are introducing a lot of columns to store the same piece of information
edit
Based on the additional edit from OP, assuming that each user will have precisely 0 or 1 of each tel, address, details, and NEVER any more, then storing those pieces of information in separate tables is overkill. It would be more sensible to store within a single user table with columns user_id, username, tel_no, addr, details.
If memory serves this is perfectly fine within 3NF though. You stated this is not about normal form, however if each piece of data is considered directly related to that specific user then it is fine to have it within the table.
If you later expanded the table to have telephone1, telephone2 (for example) then that would violate 1NF. If you have duplicate fields (i.e. multiple users share an address, which is entirely plausible), then that violates 2NF which in turn violates 3NF
This point about violating 2NF may well be why someone has done this.
The author of this design perhaps thought that storing NULLs could be achieved more efficiently in the "sparse" structure like this, than it would "in-line" in the single table. The idea was probably to store rows such as (1 , "john", NULL, NULL, NULL) just as (1 , "john") in the user table and no rows at all in other tables. For this to work, NULLs must greatly outnumber non-NULLs (and must be "mixed" in just the right way), otherwise this design quickly becomes more expensive.
Also, this could be somewhat beneficial if you'll constantly SELECT single columns. By splitting columns into separate tables, you are making them "narrower" from the storage perspective and lower the I/O in this specific case (but not in general).
The problems of this design, in my opinion, far outweigh these benefits.

For storing people in MySQL (or any DB) - multiple tables or just one?

Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.

Does it cause problems to have a table associated with multiple content types?

I have multiple content types, but they all share some similarities. I'm wondering when it is a problem to use the same table for a different content type? Is it ever a problem? If so, why?
Here's an example: I have five kinds of content, and they all have a title. So, can't I just use a 'title' table for all five content types?
Extending that example: a title is technically a name. People and places have names. Would it be bad to put all of my content titles, people names, and place names in a "name" table? Why separate into place_name, person_name, content_title?
I have different kinds of content. In the database, they seem very similar, but the application uses the content in different ways, producing different outputs. Do I need a new table for each content type because it has a different result with different kinds of dependencies, or should I just allow null values?
I wouldn't do that.
If there are multiple columns that are the same among multiple tables, you should indeed normalize these to 1 table.
And example of that would be several types of users, which all require different columns, but all share some characteristics (e.g. name, address, phone number, email address)
These could be normalized to 1 table, which is then referenced to by all other tables through a foreign key. (see http://en.wikipedia.org/wiki/Database_normalization )
Your example only shows 1 common column, which is not worth normalizing. It would even reduce performance trying to fetch your data, because you'll need to join 2 tables to get all data; 1 of which (the one with the titles) contains a lot of data you won't need all the data from, thus straining the server more.
While normalization is a very good practice to avoid redundency and ensure consistency, it can be bad for performance sometimes. For example for a person table where you have columns like name, adress, dob its not very good performance wise to have a picture in the same table. A picture can be about 1MB easily while the remaining columns may not take any more than 1K. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If there is a variation in size of the contents and you might have to retrieve only certain types of contents in the same query, the performance gain from storing them in separate tables will outweight the normalization easily.
To typify data in this way, it's best to use a table (i.e., name), and a sub-table (i.e., name_type), and then use a FK constraint. Use an FK constraint because the InnoDB does not support column constraints, and the MyISAM engine is not suited for this (it is much less robust and feature rich, and it should really only be used for performance).
This kind of normailization is fine, but it should be done with a free-format column type, like VARCHAR(40), rather than with ENUM. Use triggers to restrict the input so that it matches the types you want to support.

MySQL indexes - how to boost performace?

I'm trying to improve performance of an existing MySQL database.
It's a database regarding restaurants, there are two relevant tables:
there's a table for all entities of the website, every entity has a unique id,
an entity can be pretty much anything, it can be a restaurant, a user and many other things.
there are several entity types and as for restaurants, their entity type is "object".
Let me also say that this structure of the database is pretty much existing
so I don't want to make big changes, I'm not going to remove the table of all the entities
for example. (the Database itself has no data, but the PHP engine is built so it'll
be hard to make big changes to the structure).
there's also a table only for objects, there are several types of
objects in that database but restaurants specifically are going to be
searches for a lot since that's the subject of the website,
restaurants have several fields: country, city, name, genre.
there can't be two restaurants with the same name in the same city and country,
(There CAN be for example two restaurant with the same name but in different cities
of the same country or in two cities that have the same name but are in different countries)
so from this fact I guess I should make a unique three-column index for the country, city and name columns.
Also I want to say that the URL is also built in the form of www.domain.com/Country/City/Restuarant-Name, so the combination of country-city-name should be fetched fast and this type of query will happen a lot.
But also there'll be queries of a lot of other types like: searching for a name of
a restaurant (using a LIKE query because the name searched for can be a part
of the full name) in a certain city, or in a certain country.
searching for all the restaurants of a certain genre in a certain country and city.
and pretty much all the combination possible.
Probably the most used queries will be (a) searching for a restaurant name in a certain city
and country (which will be the same as the query used when a URL is typed but will use
LIKE), (b) searching for restaurants of a certain type in a certain city and country.
and lastly (c) searching for a restaurant name globally (in the whole database, without specifying the city and the country)
this table (the objects table) currently has PRIMARY KEY that is the ID of the objects,
the ID is also used a lot, would the best practice be the following?:
make a three-column UNIQUE index out of country,city,name
make another (not-unique) index out of the names (so a query of type c which I've wrriten
above will be executed fast)
maybe make some kind of a sub-table that contains only the restaurants out of the objects
table so this sub-table will be queried. (this is less important since if I'll decide
to make a large change I'll probably seperate the restaurants from the rest of the object
to begin with)
I'd really appreciate any help cause I've been trying to decide this for a long time.
p.s in the objects table some of the objects won't have any genre or any country or city,
so they will stay NULL, I know that NULL values are allowed in a UNIQUE KEY but will it
have an impact on performace?
Thanx alot for anyone who was willing to read this long question :)
You can think and plan as long as you want, but you won't know for certain what's best until you try, benchmark, and compare your options. That said, it certainly sounds like you're definitely on the right track.
composite key
Your "country-city-name" composite key appears to be in the most useful order, since it's ordered from broadest to narrowest selection criteria. I'm sure you did this intentionally, as a composite key's values can only be used from left to right. Because name does not come first in that index, you'd need a separate key for just name, as you noted.
index values of NULL
According to imysql.cn, "allowing NULL values in the index really doesn't impact performance." That's simply stated as an aside without data or reference, so I don't know how/if they'd proven that.
splitting the table
If there's a lot of other data mixed in with the restaurant records, sure, it could slow things a bit. If you shard the table into identically-structured "restaurant" and "other" tables, you could still easily query their combined data if necessary with a simple UNION. Unless you have an idea of the data/slowdown to expect, I'd prefer to avoid sharding the table unless necessary, at least for the sake of simplicity/uniformity.
Are there any foreseeable queries that current indexing wouldn't account for, such as a city without the country? If so, be sure to index appropriately to cover all foreseeable cases. You didn't mention it, but I assume you'll also have an index on genre.
Ultimately, you need to generate lots of test data and try it out. (Determine how much data you could eventually expect, and generate at least triple that much test data to put the system through its paces.) From what you've described, the design sounds pretty good, but testing may reveal unexpected issues, places where you'd benefit from different indexing, etc. With any issue found, you'd have a specific goal to accomplish rather than simply pondering all what-if scenarios.