MySQL Best Practice for adding columns - mysql

So I started working for a company where they had 3 to 5 different tables that were often queried in either a complex join or through a double, triple query (I'm probably the 4th person to start working here, it's very messy).
Anyhow, I created a table that when querying the other 3 or 5 tables at the same time inserts that data into my table along with whatever information normally got inserted there. It has drastically sped up the page speeds for many applications and I'm wondering if I made a mistake here.
I'm hoping that in the future to remove inserting into those other tables and simply inserting all that information into the table that I've started and to switch the applications to that one table. It's just a lot faster.
Could someone tell me why it's much faster to group all the information into one massive table and if there is any downside to doing it this way?

If the joins are slow, it may be because the tables did not have FOREIGN KEY relationships and indexes properly defined. If the tables had been properly normalized before, it is probably not a good idea to denormalize them into a single table unless they were not performant with proper indexing. FOREIGN KEY constraints require indexing on both the PK table and the related FK column, so simply defining those constraints if they don't already exist may go a long way toward improving performance.
The first course of action is to make sure the table relationships are defined correctly and the tables are indexed, before you begin denormalizing it.
There is a concept called materialized views, which serve as a sort of cache for views or queries whose result sets are deterministic, by storing the results of a view's query into a temporary table. MySQL does not support materialized views directly, but you can implement them by occasionally selecting all rows from a multi-table query and storing the output into a table. When the data in that table is stale, you overwrite it with a new rowset. For simple SELECT queries which are used to display data that doesn't change often, you may be able to speed up your pageloads using this method. It is not advisable to use it for data which is constantly changing though.
A good use for materialized views might be constructing rows to populate your site's dropdown lists or to store the result of complicated reports which are only run once a week. A bad use for them would be to store customer order information, which requires timely access.

Without seeing the table structures, etc it would be guesswork. But it sounds like possibly the database was over-normalized.
It is hard to say exactly what the issue is without seeing it. But you might want to look at adding indexes, and foreign keys to the tables.
If you are adding a table with all of the data in it, you might be denormalizing the database.

There are some cases where de-normalizing your tables has its advantages, but I would be more interested in finding out if the problem really lies with the table schema or with how the queries are being written. You need to know if the queries utilize indexes (or whether indexes need to be added to the table), whether the original query writer did things like using subselects when they could have been using joins to make a query more efficient, etc.
I would not just denormalize because it makes things faster unless there is a good reason for it.

Having a separate copy of the data in your newly defined table is a valid performance enchancing practice, but on the other hand it might become a total mess when it comes to keeping the data in your table and the other ones same. You are essentially having two truths, without good idea how to invalidate this "cache" when it comes to updates/deletes.
Read more about "normalization" and read more about "EXPLAIN" in MySQL - it will tell you why the other queries are slow and you might get away with few proper indexes and foreign keys instead of copying the data.

Related

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Alternatives to mysql for large reference tables

We currently use mysql for two types of tables:
The first set are the typical transaction based tables.
The second, are tables that are ones that store historical data which is usually write once, and read many times. They are large, hundreds of millions of rows or larger, and have a couple of indexes.
We have a couple of issues with these tables.
Any schema changes take forever
We’re not comfortable with the whole table being a single point of failure. If anything goes wrong, rebuilding this table would take ages.
It doesn't seem scalable
Are there any features of mysql we are missing that would alleviate these issues? I saw that MariaDB now has a way to add columns that doesn’t lock the whole table, but it doesn’t solve the other issues.
We’re also open to other products that might solve the issue. Any ideas?
Why would you ever need to add columns to Historical data? Anyway, what values would you assign to the 'old' rows.
An alternative to adding a column is to create a "parallel" table (aka "verdical partitioning"). The new table would have the same PRIMARY KEY as the original (except for any AUTO_INCREMENT declaration). You would use LEFT JOIN to fetch columns from both tables, and understand that 'old' rows would give you NULLs for the 'new' columns.
Another useful thing to do for Historical data is to treat it like a Fact table in Data Warehousing. The build and maintain "Summary table(s)" to significantly speed up common "report" type queries.
In newer versions of MySQL/MariaDB, ALTER TABLE ... ADD COLUMN ... ALGORITHM=INPLACE removes most of the performance pain.
Adding columns is also solved by moving toward EAV schema, which has a lot of bad qualities. So, move only part-way toward such. That is, keep the 5-10 main columns that you use for filtering and sorting as real columns, then put the rest of the key-value junk into a JSON column. Both MySQL and MariaDB had such (though with some differences), plus MariaDB has "Dynamic Columns".
Summary tables
EAV
"but it doesn’t solve the other issues" -- such as??

Is it better to have many columns, or many tables?

Imagine a hypothetical database, which is storing products. Each product have have 100 attributes, although any given product will only have values set for ~50 of these. I can see three ways to store this data:
A single table with 100 columns,
A single table with very few (say the 10 columns that have a value for every product), and another table with columns (product_id, attribute, value). I.e, An EAV data store.
A separate table for every columns. So the core products table might have 2 columns, and there would be 98 other tables, each with the two columns (product_id, value).
Setting aside the shades of grey between these extremes, from a pure efficiency standpoint, which is best to use? I assume it depends on the types of queries being run, i.e. if most queries are for several attributes of a product, or the value of a single attribute for several products. How does this affect the efficiency?
Assume this is a MySQL database using InnoDB, and all tables have appropriate foreign keys, and an index on the product_id. Imagine that the attribute names and values are strings, and are not indexed.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
I found a similar question here: Best to have hundreds of columns or split into multiple tables?
The difference is, that question is asking about a specific case, and doesn't really tell me about efficiency in the general case. Other similar questions are all talking about the best way to organize the data, I just want to know how the different organizational systems impact the speed of queries.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
JOIN will be slower.
However, if you usually query only a specific subset of columns, and this subset is "vertically partitioned" into its own separate table, querying such "lean" table is typically quicker than querying the "fat" table with all the columns.
But this is very specific and fragile (easy to break-apart as the system evolves) situation and you should test very carefully before going down that path. Your default starting position should be one table.
In general, the more tables you have, the more normalised, more correct, and hence better (ie: reduced redundancy of data) your design.
If you later find you have problems with reporting on this data, then that may be the moment to consider creating denormalised values to improve any specific performance issues. Adding denormalised values later is much less painful than normalising an existing badly designed database.
In most cases, EAV is a querying and maintenance nightmare.
An outline design would be to have a table for Products, a table for Attributes, and a ProductAttributes table that contained the ProductID and the AttributeID of the relevant entries.
As you mentioned - it strictly depends on queries, which will be executed on these data. As you know, joins are aggravating for database. I can't imagine to make 50-60 joins for simple data reading. In my humble opinion it would be madness. :) The best thing, you can do is to introduce test data and check out your specific queries in tool as Estimated Execution Plan in Management Studio. There should exist similar tool for MySQL.
I would tend to advice you to avoid creating so much tables. I think, it have to cause problems in future. Maybe it is possible to categorise rarely used data for separate tables or to use complex types? For string data you can try to use nonclustered indexes.

Dropping out all FKs from my write tables

I have some very heavy write intensive tables (user tracking tables) which will be writing nonstop. Problem is on a fully normalized schema I will have 16 foreign keys. Some keys are purely for lookup references, some are imp like linking user ID, user session ID, activity ID, etc.
With this many FK on a write intensive table performance is an issue. (I have a user content website which needs near to real time updates). So I am planning to drop all FKs for these write intensive tables but before that I want to know how else can i link data? When people say in the code, what exactly are we doing at the code level to keep data linked together as i assume in the application we cannot have relationships?
Secondly, if I dont use FKs I assume data will still be consistent as long as the the corect ID is written? Not like if member ID is 2000 it will write 3000 instead if no FK is used for whatever reason?
Lastly, this will not effect joins right? While i hope to avoid joins I may need some. But i assume FKs or not joins can still be done as is?
Secondly, if I dont use FKs I assume data will still be consistent
as long as the the corect ID is written?
Yes.
Lastly, this will not effect joins right?
right.
When people say in the code, what exactly are we doing at the
code level to keep data linked together
This is the real question. Actually, the really real two questions are:
1) How confident are you that the incoming values are all valid and do not need to be checked.
2) How big are the lookup tables being referenced?
If the answers are "not very confident" and "really small" then you can enforce in code by caching the answers in the app layer and just doing lookups using these super-fast in-memory tables before inserting. however, consider this, the database will also cache those small tables, so it might still be simpler to keep the fks.
If the answers are "not very confident" and "really huge" then you have a choice. You can drop the FK constraints, knowingly insert bad values and do some post-job cleanup, or you can keep those fks in the database because otherwise you've got all of that bad data.
For this combination it is not practical to cache the tables in the app, and if you drop thee fks and do lookups from the app it is even slower than having fk's in the database.
If the answers are "100% confident" then the 2nd question does not matter. Drop the fk's and insert the data with speed and confidence.

What is a good way to denormalize a mysql database?

I have a large database of normalized order data that is becoming very slow to query for reporting. Many of the queries that I use in reports join five or six tables and are having to examine tens or hundreds of thousands of lines.
There are lots of queries and most have been optimized as much as possible to reduce server load and increase speed. I think it's time to start keeping a copy of the data in a denormalized format.
Any ideas on an approach? Should I start with a couple of my worst queries and go from there?
I know more about mssql that mysql, but I don't think the number of joins or number of rows you are talking about should cause you too many problems with the correct indexes in place. Have you analyzed the query plan to see if you are missing any?
http://dev.mysql.com/doc/refman/5.0/en/explain.html
That being said, once you are satisifed with your indexes and have exhausted all other avenues, de-normalization might be the right answer. If you just have one or two queries that are problems, a manual approach is probably appropriate, whereas some sort of data warehousing tool might be better for creating a platform to develop data cubes.
Here's a site I found that touches on the subject:
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
Here's a simple technique that you can use to keep denormalizing queries simple, if you're just doing a few at a time (and I'm not replacing your OLTP tables, just creating a new one for reporting purposes). Let's say you have this query in your application:
select a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id where a.id=1
You could create a denormalized table and populate with almost the same query:
create table tbl_ab (a_id, a_name, b_address);
-- (types elided)
Notice the underscores match the table aliases you use
insert tbl_ab select a.id, a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id
-- no where clause because you want everything
Then to fix your app to use the new denormalized table, switch the dots for underscores.
select a_name as name, b_address as address
from tbl_ab where a_id = 1;
For huge queries this can save a lot of time and makes it clear where the data came from, and you can re-use the queries you already have.
Remember, I'm only advocating this as the last resort. I bet there's a few indexes that would help you. And when you de-normalize, don't forget to account for the extra space on your disks, and figure out when you will run the query to populate the new tables. This should probably be at night, or whenever activity is low. And the data in that table, of course, will never exactly be up to date.
[Yet another edit] Don't forget that the new tables you create need to be indexed too! The good part is that you can index to your heart's content and not worry about update lock contention, since aside from your bulk insert the table will only see selects.
MySQL 5 does support views, which may be helpful in this scenario. It sounds like you've already done a lot of optimizing, but if not you can use MySQL's EXPLAIN syntax to see what indexes are actually being used and what is slowing down your queries.
As far as going about normalizing data (whether you're using views or just duplicating data in a more efficient manner), I think starting with the slowest queries and working your way through is a good approach to take.
I know this is a bit tangential, but have you tried seeing if there are more indexes you can add?
I don't have a lot of DB background, but I am working with databases a lot recently, and I've been finding that a lot of the queries can be improved just by adding indexes.
We are using DB2, and there is a command called db2expln and db2advis, the first will indicate whether table scans vs index scans are being used, and the second will recommend indexes you can add to improve performance. I'm sure MySQL has similar tools...
Anyways, if this is something you haven't considered yet, it has been helping a lot with me... but if you've already gone this route, then I guess it's not what you are looking for.
Another possibility is a "materialized view" (or as they call it in DB2), which lets you specify a table that is essentially built of parts from multiple tables. Thus, rather than normalizing the actual columns, you could provide this view to access the data... but I don't know if this has severe performance impacts on inserts/updates/deletes (but if it is "materialized", then it should help with selects since the values are physically stored separately).
In line with some of the other comments, i would definately have a look at your indexing.
One thing i discovered earlier this year on our MySQL databases was the power of composite indexes. For example, if you are reporting on order numbers over date ranges, a composite index on the order number and order date columns could help. I believe MySQL can only use one index for the query so if you just had separate indexes on the order number and order date it would have to decide on just one of them to use. Using the EXPLAIN command can help determine this.
To give an indication of the performance with good indexes (including numerous composite indexes), i can run queries joining 3 tables in our database and get almost instant results in most cases. For more complex reporting most of the queries run in under 10 seconds. These 3 tables have 33 million, 110 million and 140 millions rows respectively. Note that we had also already normalised these slightly to speed up our most common query on the database.
More information regarding your tables and the types of reporting queries may allow further suggestions.
For MySQL I like this talk: Real World Web: Performance & Scalability, MySQL Edition. This contains a lot of different pieces of advice for getting more speed out of MySQL.
You might also want to consider selecting into a temporary table and then performing queries on that temporary table. This would avoid the need to rejoin your tables for every single query you issue (assuming that you can use the temporary table for numerous queries, of course). This basically gives you denormalized data, but if you are only doing select calls, there's no concern about data consistency.
Further to my previous answer, another approach we have taken in some situations is to store key reporting data in separate summary tables. There are certain reporting queries which are just going to be slow even after denormalising and optimisations and we found that creating a table and storing running totals or summary information throughout the month as it came in made the end of month reporting much quicker as well.
We found this approach easy to implement as it didn't break anything that was already working - it's just additional database inserts at certain points.
I've been toying with composite indexes and have seen some real benefits...maybe I'll setup some tests to see if that can save me here..at least for a little longer.