MySQL indexes - how to boost performace? - mysql

I'm trying to improve performance of an existing MySQL database.
It's a database regarding restaurants, there are two relevant tables:
there's a table for all entities of the website, every entity has a unique id,
an entity can be pretty much anything, it can be a restaurant, a user and many other things.
there are several entity types and as for restaurants, their entity type is "object".
Let me also say that this structure of the database is pretty much existing
so I don't want to make big changes, I'm not going to remove the table of all the entities
for example. (the Database itself has no data, but the PHP engine is built so it'll
be hard to make big changes to the structure).
there's also a table only for objects, there are several types of
objects in that database but restaurants specifically are going to be
searches for a lot since that's the subject of the website,
restaurants have several fields: country, city, name, genre.
there can't be two restaurants with the same name in the same city and country,
(There CAN be for example two restaurant with the same name but in different cities
of the same country or in two cities that have the same name but are in different countries)
so from this fact I guess I should make a unique three-column index for the country, city and name columns.
Also I want to say that the URL is also built in the form of www.domain.com/Country/City/Restuarant-Name, so the combination of country-city-name should be fetched fast and this type of query will happen a lot.
But also there'll be queries of a lot of other types like: searching for a name of
a restaurant (using a LIKE query because the name searched for can be a part
of the full name) in a certain city, or in a certain country.
searching for all the restaurants of a certain genre in a certain country and city.
and pretty much all the combination possible.
Probably the most used queries will be (a) searching for a restaurant name in a certain city
and country (which will be the same as the query used when a URL is typed but will use
LIKE), (b) searching for restaurants of a certain type in a certain city and country.
and lastly (c) searching for a restaurant name globally (in the whole database, without specifying the city and the country)
this table (the objects table) currently has PRIMARY KEY that is the ID of the objects,
the ID is also used a lot, would the best practice be the following?:
make a three-column UNIQUE index out of country,city,name
make another (not-unique) index out of the names (so a query of type c which I've wrriten
above will be executed fast)
maybe make some kind of a sub-table that contains only the restaurants out of the objects
table so this sub-table will be queried. (this is less important since if I'll decide
to make a large change I'll probably seperate the restaurants from the rest of the object
to begin with)
I'd really appreciate any help cause I've been trying to decide this for a long time.
p.s in the objects table some of the objects won't have any genre or any country or city,
so they will stay NULL, I know that NULL values are allowed in a UNIQUE KEY but will it
have an impact on performace?
Thanx alot for anyone who was willing to read this long question :)

You can think and plan as long as you want, but you won't know for certain what's best until you try, benchmark, and compare your options. That said, it certainly sounds like you're definitely on the right track.
composite key
Your "country-city-name" composite key appears to be in the most useful order, since it's ordered from broadest to narrowest selection criteria. I'm sure you did this intentionally, as a composite key's values can only be used from left to right. Because name does not come first in that index, you'd need a separate key for just name, as you noted.
index values of NULL
According to imysql.cn, "allowing NULL values in the index really doesn't impact performance." That's simply stated as an aside without data or reference, so I don't know how/if they'd proven that.
splitting the table
If there's a lot of other data mixed in with the restaurant records, sure, it could slow things a bit. If you shard the table into identically-structured "restaurant" and "other" tables, you could still easily query their combined data if necessary with a simple UNION. Unless you have an idea of the data/slowdown to expect, I'd prefer to avoid sharding the table unless necessary, at least for the sake of simplicity/uniformity.
Are there any foreseeable queries that current indexing wouldn't account for, such as a city without the country? If so, be sure to index appropriately to cover all foreseeable cases. You didn't mention it, but I assume you'll also have an index on genre.
Ultimately, you need to generate lots of test data and try it out. (Determine how much data you could eventually expect, and generate at least triple that much test data to put the system through its paces.) From what you've described, the design sounds pretty good, but testing may reveal unexpected issues, places where you'd benefit from different indexing, etc. With any issue found, you'd have a specific goal to accomplish rather than simply pondering all what-if scenarios.

Related

MySQL: database structure choice - big data - duplicate data or bridging

We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...

Primary key: a string or number (id)?

I am aware of benefits of using integers (amount of space, performance, indexes) as primary keys as opposite to strings.
Considering situation below...
I have a lookup table called ap_habitat (habitat values are also unique)
id habitat
1 Forest 1
2 Forest 2
Referenced table (fauna)
Especie habitat
X 1
Y 1
Referenced table is not very human readable (I know end users should not care about that, but for me would be useful to directly see in fauna table the NAME of the habitat).
To get a list of fauna and its habitat name I have to do a join...
select fauna.habitat, fauna.especie, AP_h.habitat from fauna INNER JOIN ap_habitat AS AP_h on AP_h.id=1
I could create a view, but if I have to create a view for each table referencing a foreign key...
Just wanna check what more experienced people recommend me.
Databases and, in general, computers are not designed to make your life more simple. They are designed to handle more data than a human mind can ever hope to remember in less time than it takes a human to blink. ;-)
Readability (especially in ideas conceived the before-Apple age) is not an issue at all.
On top of that: If you enjoy strange problems, data mapping impedance and spending endless nights writing workarounds for problems that using real-world names as primary keys get you for free, then be our guest. But please, don't ask for our help. We already know all the problems that you'll run into and it will be very hard for us to restrain our spite.
So: Never, ever use anything but an ID (UUID or long sequence) for a primary key. There are no (good) reasons to do it and if you found one, then you simply don't see the whole picture.
Yes, it makes a couple of things harder (like understanding what your data actually means). But as I said above, computers are meant to solve "lots of data" and "too slow" and nothing else.
Create a view or write a small helper application that can run your most important queries at the click of a button.
That said, I had some success with an application which runs a query and then displays a list of check boxes where I can pull in the foreign key relations to the data that the query returns (i.e. one checkbox per FK).
You ask about number or string as primary key. But based on your example if you use a string it wouldn't be a primary key at all, because you would no longer have a lookup table for it to be the primary key of. Perhaps you would still have the table for reasons not shown, like populating a drop down or storing extended descriptions beyond just the name.
Doing needless joins is not a good thing for performance. And having needless tables might be bad for storage size as well, depending on the length of the strings and the ratio of the sizes of the two tables.
You could also consider enumerated types, in which the data is stored as numbers (more or less) but the database translates them to and from strings automatically.

For storing people in MySQL (or any DB) - multiple tables or just one?

Our company has many different entities, but a good chunk of those database entities are people. So we have customers, and employees, and potential clients, and contractors, and providers and all of them have certain attributes in common, namely names and contact phone numbers.
I may have gone overboard with object-oriented thinking but now I am looking at making one "Person" table that contains all of the people, with flags/subtables "extending" that model and adding role-based attributes to junction tables as necessary. If we grow to say 250.000 people (on MySQL and ISAM) will this so greatly impact performance that future DBAs will curse me forever? Our single most common search is on name/surname combinations.
For, e.g. a company like Salesforce, are Clients/Leads/Employees all in a centralised table with sub-views (for want of a better term) or are they separated into different tables?
Caveat: this question is to do with "we found it better to do this in the real world" as opposed to theoretical design. I like the above solution, and am confident that with views, proper sizing and accurate indexing, that performance won't suffer. I also feel that the above doesn't count as a MUCK, just a pretty big table.
One 'person' table is the most flexible, efficient, and trouble-free approach.
It will be easy for you to do limited searches - find all people with this last name and who are customers, for example. But you may also find you have to look up someone when you don't know what they are - that will be easiest when you have one 'person' table.
However, you must consider the possibility that one person is multiple things to you - a customer because the bought something and a contractor because you hired them for a job. It would be better, therefore, to have a 'join' table that gives you a many to many relationship.
create person_type (
person_id int unsigned,
person_type_id int unsigned,
date_started datetime,
date_ended datetime,
[ ... ]
)
(You'll want to add indexes and foreign keys, of course. person_id is a FK to 'person' table; 'person_type_id' is a FK to your reference table for all possible person types. I've added two date fields so you can establish when someone was what to you.)
Since you have many different "types" of Persons, in order to have normalized design, with proper Foreign Key constraints, it's better to use the supertype/subtype pattern. One Person table (with the common to all attributes) and many subtype tables (Employee, Contractor, Customer, etc.), all in 1:1 relationship with the main Person table, and with necessary details for every type of Person.
Check this answer by #Branko for an example: Many-to-Many but sourced from multiple tables
250.000 records for a database is not very much. If you set your indexes appropriately you will never find any problems with that.
You should probably set a type for a user. Those types should be in a different table, so you can see what the type means (make it an TINYINT or similar). If you need additional fields per user type, you could indeed create a different table for that.
This approach sounds really good to me
Theoretically it would be possible to be a customer for the company you work for.
But if that's not the case here, then you could store people in different tables depending on their role.
However like Topener said, 250.000 isn't much. So I would personally feel safe to store every single person in one table.
And then have a column for each role (employee, customer, etc.)
Even if you end up with a one table solution (for core person attributes), you are going to want to abstract it with views and put on some constraints.
The last thing you want to do is send confidential information to clients which was only supposed to go to employees because someone didn't join correctly. Or an accidental cross join which results in income being doubled on a report (but only for particular clients which also had an employee linked somehow).
It really depends on how you want the layers to look and which components are going to access which layers and how.
Also, I would think you want to revisit your choice of MyISAM over InnoDB.

Does it cause problems to have a table associated with multiple content types?

I have multiple content types, but they all share some similarities. I'm wondering when it is a problem to use the same table for a different content type? Is it ever a problem? If so, why?
Here's an example: I have five kinds of content, and they all have a title. So, can't I just use a 'title' table for all five content types?
Extending that example: a title is technically a name. People and places have names. Would it be bad to put all of my content titles, people names, and place names in a "name" table? Why separate into place_name, person_name, content_title?
I have different kinds of content. In the database, they seem very similar, but the application uses the content in different ways, producing different outputs. Do I need a new table for each content type because it has a different result with different kinds of dependencies, or should I just allow null values?
I wouldn't do that.
If there are multiple columns that are the same among multiple tables, you should indeed normalize these to 1 table.
And example of that would be several types of users, which all require different columns, but all share some characteristics (e.g. name, address, phone number, email address)
These could be normalized to 1 table, which is then referenced to by all other tables through a foreign key. (see http://en.wikipedia.org/wiki/Database_normalization )
Your example only shows 1 common column, which is not worth normalizing. It would even reduce performance trying to fetch your data, because you'll need to join 2 tables to get all data; 1 of which (the one with the titles) contains a lot of data you won't need all the data from, thus straining the server more.
While normalization is a very good practice to avoid redundency and ensure consistency, it can be bad for performance sometimes. For example for a person table where you have columns like name, adress, dob its not very good performance wise to have a picture in the same table. A picture can be about 1MB easily while the remaining columns may not take any more than 1K. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If there is a variation in size of the contents and you might have to retrieve only certain types of contents in the same query, the performance gain from storing them in separate tables will outweight the normalization easily.
To typify data in this way, it's best to use a table (i.e., name), and a sub-table (i.e., name_type), and then use a FK constraint. Use an FK constraint because the InnoDB does not support column constraints, and the MyISAM engine is not suited for this (it is much less robust and feature rich, and it should really only be used for performance).
This kind of normailization is fine, but it should be done with a free-format column type, like VARCHAR(40), rather than with ENUM. Use triggers to restrict the input so that it matches the types you want to support.

When is it a good idea to move columns off a main table into an auxiliary table?

Say I have a table like this:
create table users (
user_id int not null auto_increment,
username varchar,
joined_at datetime,
bio text,
favorite_color varchar,
favorite_band varchar
....
);
Say that over time, more and more columns -- like favorite_animal, favorite_city, etc. -- get added to this table.
Eventually, there are like 20 or more columns.
At this point, I'm feeling like I want to move columns to a separate
user_profiles table is so I can do select * from users without
returning a large number of usually irrelevant columns (like
favorite_color). And when I do need to query by favorite_color, I can just do
something like this:
select * from users inner join user_profiles using user_id where
user_profiles.favorite_color = 'red';
Is moving columns off the main table into an "auxiliary" table a good
idea?
Or is it better to keep all the columns in the users table, and always
be explicit about the columns I want to return? E.g.
select user_id, username, last_logged_in_at, etc. etc. from users;
What performance considerations are involved here?
Don't use an auxiliary table if it's going to contain a collection of miscellaneous fields with no conceptual cohesion.
Do use a separate table if you can come up with a good conceptual grouping of a number of fields e.g. an Address table.
Of course, your application has its own performance and normalisation needs, and you should only apply this advice with proper respect to your own situation.
I would say that the best option is to have properly normalized tables, and also to only ask for the columns you need.
A user profile table might not be a bad idea, if it is structured well to provide data integrity and simple enhancement/modification later. Only you can truly know your requirements.
One thing that no one else has mentioned is that it is often a good idea to have an auxiliary table if the row size of the main table would get too large. Read about the row size limits of your specific databases in the documentation. There are often performance benefits to having tables that are less wide and moving the fields you don't use as often off to a separate table. If you choose to create an auxiliarary table with a one-to-one relationship make sure to set up the PK/FK relationship to maintain data integrity and set a unique index or constraint on the FK field to mainatin the one-to-one relationship.
And to go along with everyone else, I cannot stress too strongly how bad it is to ever use select * in production queries. You save a few seconds of development time and create a performance problem as well as make the application less maintainable (yes less - as you should not willy nilly return things you may not want to show on the application but you need in the database. You will break insert statements that use selects and show users things you don't want them to see when you use select *.).
Try not to get in the habit of using SELECT * FROM ... If your application becomes large, and you query the users table for different things in different parts of your application, then when you do add favorite_animal you are more likely to break some spot that uses SELECT *. Or at the least, that place is now getting unused fields that slows it down.
Select the data you need specifically. It self-documents to the next person exactly what you're trying to do with that code.
Don't de-normalize unless you have good reason to.
Adding a favorite column ever other day every time a user has a new favorite is a maintenance headache at best. I would highly consider creating a table to hold a favorites value in your case. I'm pretty sure I wouldn't just keep adding a new column all the time.
The general guideline that applies to this (called normalization) is that tables are grouped by distinct entities/objects/concepts and that each column(field) in that table should describe some aspect of that entity
In your example, it seems that favorite_color describes (or belongs to) the user. Some times it is a good idea to moved data to a second table: when it becomes clear that that data actually describes a second entity. For example: You start your database collecting user_id, name, email, and zip_code. Then at some point in time, the ceo decides he would also like to collect the street_address. At this point a new entity has been formed, and you could conceptually view your data as two tables:
user: userid, name, email
address: steetaddress, city, state, zip, userid(as a foreign key)
So, to sum it up: the real challenge is to decide what data describes the main entity of the table, and what, if any, other entity exists.
Here is a great example of normalization that helped me understand it better
When there is no other reason (e.g. there are normal forms for databases) you should not do it. You dont save any space, as the data must still stored, instead you waste more as you need another index to access them.
It is always better (though may require more maintenance if schemas change) to fetch only the columns you need.
This will result in lower memory usage by both MySQL and your client application, and reduced query times as the amount of data transferred is reduced. You'll see a benefit whether this is over a network or not.
Here's a rule of thumb: if adding a column to an existing table would require making it nullable (after data has been migrated etc) then instead create a new table with all NOT NULL columns (with a foreign key reference to the original table, of course).
You should not rely on using SELECT * for a variety of reasons (google it).