Alternatives to mysql for large reference tables - mysql

We currently use mysql for two types of tables:
The first set are the typical transaction based tables.
The second, are tables that are ones that store historical data which is usually write once, and read many times. They are large, hundreds of millions of rows or larger, and have a couple of indexes.
We have a couple of issues with these tables.
Any schema changes take forever
We’re not comfortable with the whole table being a single point of failure. If anything goes wrong, rebuilding this table would take ages.
It doesn't seem scalable
Are there any features of mysql we are missing that would alleviate these issues? I saw that MariaDB now has a way to add columns that doesn’t lock the whole table, but it doesn’t solve the other issues.
We’re also open to other products that might solve the issue. Any ideas?

Why would you ever need to add columns to Historical data? Anyway, what values would you assign to the 'old' rows.
An alternative to adding a column is to create a "parallel" table (aka "verdical partitioning"). The new table would have the same PRIMARY KEY as the original (except for any AUTO_INCREMENT declaration). You would use LEFT JOIN to fetch columns from both tables, and understand that 'old' rows would give you NULLs for the 'new' columns.
Another useful thing to do for Historical data is to treat it like a Fact table in Data Warehousing. The build and maintain "Summary table(s)" to significantly speed up common "report" type queries.
In newer versions of MySQL/MariaDB, ALTER TABLE ... ADD COLUMN ... ALGORITHM=INPLACE removes most of the performance pain.
Adding columns is also solved by moving toward EAV schema, which has a lot of bad qualities. So, move only part-way toward such. That is, keep the 5-10 main columns that you use for filtering and sorting as real columns, then put the rest of the key-value junk into a JSON column. Both MySQL and MariaDB had such (though with some differences), plus MariaDB has "Dynamic Columns".
Summary tables
EAV
"but it doesn’t solve the other issues" -- such as??

Related

Alternatives to alter table in mysql

We use mysql(AWS aurora) to store data of our online payment transactions. One of our tables, in which each row stores information of a particular transaction, has more than 1 billion rows.
How can I go about adding a new attribute for a transaction ? Altering this table is not possible because of large amount of time required to do so.
Only possible solution seems to be creating a new table which stores key-value pairs for each transaction. Are there other more efficient ways to do this, assuming altering table structure is not possible ?
An alternative is to create a parallel table. It would have the same PRIMARY KEY as your current table (but without AUTO_INCREMENT). And it would have the 'new' column(s).
Then you would JOIN on the PK to fetch both old and new columns at the same time.
Pros: No downtime, no big ALTER, etc.
Cons: Now the table is split in two. Subsequent columns being added go through the same dilemma.
Alternative to the alternative: Put a JSON column in that new table.
Pros: Very open-ended wrt adding more columns.
Cons: Can't index it very well. (This depends on what version you are using.)
At my work, we have quite a few tables with over 1 billion rows. Developers add or remove columns, change data types, add or remove indexes, etc. Any kind of ALTER TABLE.
The way we do this is to use pt-online-schema-change, a free tool available from Percona. It allows you to do long-running schema changes, and you can still read and write the table while it's doing the change in the background.
It still takes a long time to do a change to a large table. In the largest cases, it takes weeks. But it doesn't block your work in the meantime.

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

MySQL Best Practice for adding columns

So I started working for a company where they had 3 to 5 different tables that were often queried in either a complex join or through a double, triple query (I'm probably the 4th person to start working here, it's very messy).
Anyhow, I created a table that when querying the other 3 or 5 tables at the same time inserts that data into my table along with whatever information normally got inserted there. It has drastically sped up the page speeds for many applications and I'm wondering if I made a mistake here.
I'm hoping that in the future to remove inserting into those other tables and simply inserting all that information into the table that I've started and to switch the applications to that one table. It's just a lot faster.
Could someone tell me why it's much faster to group all the information into one massive table and if there is any downside to doing it this way?
If the joins are slow, it may be because the tables did not have FOREIGN KEY relationships and indexes properly defined. If the tables had been properly normalized before, it is probably not a good idea to denormalize them into a single table unless they were not performant with proper indexing. FOREIGN KEY constraints require indexing on both the PK table and the related FK column, so simply defining those constraints if they don't already exist may go a long way toward improving performance.
The first course of action is to make sure the table relationships are defined correctly and the tables are indexed, before you begin denormalizing it.
There is a concept called materialized views, which serve as a sort of cache for views or queries whose result sets are deterministic, by storing the results of a view's query into a temporary table. MySQL does not support materialized views directly, but you can implement them by occasionally selecting all rows from a multi-table query and storing the output into a table. When the data in that table is stale, you overwrite it with a new rowset. For simple SELECT queries which are used to display data that doesn't change often, you may be able to speed up your pageloads using this method. It is not advisable to use it for data which is constantly changing though.
A good use for materialized views might be constructing rows to populate your site's dropdown lists or to store the result of complicated reports which are only run once a week. A bad use for them would be to store customer order information, which requires timely access.
Without seeing the table structures, etc it would be guesswork. But it sounds like possibly the database was over-normalized.
It is hard to say exactly what the issue is without seeing it. But you might want to look at adding indexes, and foreign keys to the tables.
If you are adding a table with all of the data in it, you might be denormalizing the database.
There are some cases where de-normalizing your tables has its advantages, but I would be more interested in finding out if the problem really lies with the table schema or with how the queries are being written. You need to know if the queries utilize indexes (or whether indexes need to be added to the table), whether the original query writer did things like using subselects when they could have been using joins to make a query more efficient, etc.
I would not just denormalize because it makes things faster unless there is a good reason for it.
Having a separate copy of the data in your newly defined table is a valid performance enchancing practice, but on the other hand it might become a total mess when it comes to keeping the data in your table and the other ones same. You are essentially having two truths, without good idea how to invalidate this "cache" when it comes to updates/deletes.
Read more about "normalization" and read more about "EXPLAIN" in MySQL - it will tell you why the other queries are slow and you might get away with few proper indexes and foreign keys instead of copying the data.

Should one steer clear of adding yet another field to a larger MySQL table?

I have a MySQL-InnoDB table with 350,000+ rows, containing a couple of things like id, otherId, shortTitle and so on. Now I'm in need of a Bool/ Bit field for perhaps a couple of hundreds or thousands of those rows. Should I just add that bool field into the table, or should I best create a new table referencing the IDs of the old table -- thereby not risking to cause performance issues on all the old existing functions that access the first table?
(Side info: I'm never using "SELECT * ...". The main table has lots of reading, rarely writing.)
Adding a field can indeed hamper performance a little, since your table row grow larger, but it's hardly a problem for a BIT field.
Most probably, you will have exactly same row count per page, which means having no performance decrease at all.
On the other hand, using an extra JOIN to access the row value in another table will be much slower.
I'd add the column right into the table.
What does the new column denote?
From the data modelling perspective, if the column belongs with the data under whichever normal form is in use, then put it with the data; performance impact be damned. If the column doesn't directly belong to the table, then put it in a second table with a foreign key.
Realistically, the performance impact of adding a new column on a table with ~350,000 isn't going to be particularly huge. Have you tried issuing the ALTER TABLE statement against a copy, perhaps on a local workstation?
I don't know why people insist in called 350K-row tables big. In the mainframe world, that's how big the DBMS configuration tables are :-).
That said, you should be designing your tables in third normal form. If, and only if, you have performance problems, then should you consider de-normalizing.
If you have a column that will apply only to certain of the rows, it's (probably) not going to be 3NF to put it in the same table. You should have a separate table with a foreign key into your 'primary' table.
Keep in mind that's if the boolean field actually doesn't apply to some of the rows. That's a different situation to the field applying to all rows but not being known for some. In that case, a nullable column in the primary table would be better. But that doesn't sound like what you're describing.
Requiring a bit field for the next entries only sounds like you want to implement inheritance. If that is the case, I would add it to a new table to keep things readable. Otherwise, it doesn't matter if you add it to the main table or not, unless your queries are not using indexes, in which case I would change that first before making any other decisions regarding performance.