How bad is it to change the structure of big tables? - mysql

My target DB is MySQL, but lets try to look at this problem more "generic". I am about to create a framework that will modify table structures as needed. Tables may contain a hundred thousands of records some day. I might add a column, rename a column, or even change the type of a column (lets assume that's nothing impossible, i.e. I might only change a numeric column into a varchar column), or I might also drop a whole column. But in most cases I would just add columns.
Somebody told me that's an very bad idea. He said that very loud and clear, but with no further argumentation. So what do you think? Bad idea? Why?

In most databases, adding and renaming columns are simple operations that just change the table metadata. You should actually verify that that's the case for the MySQL storage engine you're using, though. Dropping a column should be lightweight too.
By contrast, changing the type of a column is an intensive operation, since it involves actually creating data for each row in the table. Similarly, adding a column and populating it (rather than leaving the new column's values as null).

It is a bad idea.
You will have client applications using these columns, so whenever you rename or remove any columns, you will break their code.
Changing the type of a column might be necessary sometimes, but that's not something you will have to do often. If the column is numeric and you do calculations with it, then you can't change it to be of type varchar. And if you don't need it for any calculations and you expect non-numeric characters later, why not define it as varchar in first place?

Databases are built to do some things very well: to handle data integrity, referential integrity and transactional integrity, to name just three. Messing with table structures (when they have existing data and relationships) is probably violating all three of those things simultaneously.
If you want to adjust the way your data is presented to the user (limiting visibility of columns, renaming columns etc.), you are much better off doing this by exposing your data via views and altering their definition as needed. But leave the tables alone!

it is very harmful and should be avoided as integrity of data is disturbed and your queries will also fail incase you have hardcoded queries or improper ORM.
if you are doing it be cautious and if it live take the backup of your DB first.

Related

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Alternatives to mysql for large reference tables

We currently use mysql for two types of tables:
The first set are the typical transaction based tables.
The second, are tables that are ones that store historical data which is usually write once, and read many times. They are large, hundreds of millions of rows or larger, and have a couple of indexes.
We have a couple of issues with these tables.
Any schema changes take forever
We’re not comfortable with the whole table being a single point of failure. If anything goes wrong, rebuilding this table would take ages.
It doesn't seem scalable
Are there any features of mysql we are missing that would alleviate these issues? I saw that MariaDB now has a way to add columns that doesn’t lock the whole table, but it doesn’t solve the other issues.
We’re also open to other products that might solve the issue. Any ideas?
Why would you ever need to add columns to Historical data? Anyway, what values would you assign to the 'old' rows.
An alternative to adding a column is to create a "parallel" table (aka "verdical partitioning"). The new table would have the same PRIMARY KEY as the original (except for any AUTO_INCREMENT declaration). You would use LEFT JOIN to fetch columns from both tables, and understand that 'old' rows would give you NULLs for the 'new' columns.
Another useful thing to do for Historical data is to treat it like a Fact table in Data Warehousing. The build and maintain "Summary table(s)" to significantly speed up common "report" type queries.
In newer versions of MySQL/MariaDB, ALTER TABLE ... ADD COLUMN ... ALGORITHM=INPLACE removes most of the performance pain.
Adding columns is also solved by moving toward EAV schema, which has a lot of bad qualities. So, move only part-way toward such. That is, keep the 5-10 main columns that you use for filtering and sorting as real columns, then put the rest of the key-value junk into a JSON column. Both MySQL and MariaDB had such (though with some differences), plus MariaDB has "Dynamic Columns".
Summary tables
EAV
"but it doesn’t solve the other issues" -- such as??

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

MYSQL - int or short string?

I'm going to create a table which will have an amount of rows between 1000-20000, and I'm having fields that might repeat a lot... about 60% of the rows will have this value, where about each 50-100 have a shared value.
I've been concerned about efficiency lately and I'm wondering whether it would be better to store this string on each row (it would be between 8-20 characters) or to create another table and link them with its representative ID instead... So having ~1-50 rows in this table replacing about 300-5000 strings with ints?
Is this a good approach, or at all even neccessary?
Yes, it's a good approach in most circumstances. It's called normalisation, and is mainly done for two reasons:
Removing repeated data
Avoiding repeating entities
I can't tell from your question what the reason would be in your case.
The difference between the two is that the first reuses values that just happen to look the same, while the second connects values that have the same meaning. The practical difference is in what should happen if a value changes, i.e. if the value changes for one record, should the value itself change so that it changes for all other records also using it, or should that record be connected to a new value so that the other records are left unchanged.
If it's for the first reason then you will save space in the database, but it will be more complicated to update records. If it's for the second reason you will not only save space, but you will also reduce the risk of inconsistency, as a value is only stored in one place.
That is a good approach to have a look-up table for the strings. That way you can build more efficient indexes on the integer values. It wouldn't be absolutely necessary but as a good practice I would do that.
I would recommend using an int with a foreign key to a lookup table (like you describe in your second scenario). This will cause the index to be much smaller than indexing a VARCHAR so the storage required would be smaller. It should perform better, too.
Avitus is right, that it's generally a good practice to create lookups.
Think about the JOINS you will use this table in. 1000-20000 rows are not a lot to be handled by MySQL. If you don't have any, I would not bother about the lookups, just index the column.
BUT as soon as you start joining the table with others (of the same size) that's where the performance loss comes in, which you can (most likely) compensate by introducing lookups.

Should one steer clear of adding yet another field to a larger MySQL table?

I have a MySQL-InnoDB table with 350,000+ rows, containing a couple of things like id, otherId, shortTitle and so on. Now I'm in need of a Bool/ Bit field for perhaps a couple of hundreds or thousands of those rows. Should I just add that bool field into the table, or should I best create a new table referencing the IDs of the old table -- thereby not risking to cause performance issues on all the old existing functions that access the first table?
(Side info: I'm never using "SELECT * ...". The main table has lots of reading, rarely writing.)
Adding a field can indeed hamper performance a little, since your table row grow larger, but it's hardly a problem for a BIT field.
Most probably, you will have exactly same row count per page, which means having no performance decrease at all.
On the other hand, using an extra JOIN to access the row value in another table will be much slower.
I'd add the column right into the table.
What does the new column denote?
From the data modelling perspective, if the column belongs with the data under whichever normal form is in use, then put it with the data; performance impact be damned. If the column doesn't directly belong to the table, then put it in a second table with a foreign key.
Realistically, the performance impact of adding a new column on a table with ~350,000 isn't going to be particularly huge. Have you tried issuing the ALTER TABLE statement against a copy, perhaps on a local workstation?
I don't know why people insist in called 350K-row tables big. In the mainframe world, that's how big the DBMS configuration tables are :-).
That said, you should be designing your tables in third normal form. If, and only if, you have performance problems, then should you consider de-normalizing.
If you have a column that will apply only to certain of the rows, it's (probably) not going to be 3NF to put it in the same table. You should have a separate table with a foreign key into your 'primary' table.
Keep in mind that's if the boolean field actually doesn't apply to some of the rows. That's a different situation to the field applying to all rows but not being known for some. In that case, a nullable column in the primary table would be better. But that doesn't sound like what you're describing.
Requiring a bit field for the next entries only sounds like you want to implement inheritance. If that is the case, I would add it to a new table to keep things readable. Otherwise, it doesn't matter if you add it to the main table or not, unless your queries are not using indexes, in which case I would change that first before making any other decisions regarding performance.