Strategies for large databases with changing schemas - mysql

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?

In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.

Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

Related

Optional fields on very large db

I am currently working on the redesign of a mysql bdd.
This database is actually a huge table of 200 fields, several million lines, almost no index... In short, disastrous performances and huge ram consumption!
I first managed to reduce the number of fields by setting up 1:n relationships.
I have a question on a specific point:
A number of fields on this database are optional and rarely filled in. (sometimes almost never)
What is the best thing to do in this case?
leave the field in the table even if it is very often of null value
set up a n:n relationship knowing that these relationships, if they exist, will only return one line
...or another solution I haven't thought of
Thank you in advance for your wise advice;)
Dimitri
My suggestion:
First of all, make sure to normalize your db. At least to 3rd Normal Form. This will probably reduce some of your original columns and split them over several tables.
Once that is done, in case you still have lots of 'optional and rarely filled' columns in some of your tables, consider how to proceed depending on your needs: What is most important to you? Disk space or Performance?
Take a look at MySQL Optimazing Data Size for extra tips/improvements for the re-design of your database ...depending on your needs.
Regarding 'set up a n:n relationship...' (I figure you meant 1:1 relationship) can be an interesting option in some cases:
In some circumstances, it can be beneficial to split into two a table that is scanned very often. This is especially true if it is a dynamic-format table and it is possible to use a smaller static format table that can be used to find the relevant rows when scanning the table (extracted from here)

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Alternatives to mysql for large reference tables

We currently use mysql for two types of tables:
The first set are the typical transaction based tables.
The second, are tables that are ones that store historical data which is usually write once, and read many times. They are large, hundreds of millions of rows or larger, and have a couple of indexes.
We have a couple of issues with these tables.
Any schema changes take forever
We’re not comfortable with the whole table being a single point of failure. If anything goes wrong, rebuilding this table would take ages.
It doesn't seem scalable
Are there any features of mysql we are missing that would alleviate these issues? I saw that MariaDB now has a way to add columns that doesn’t lock the whole table, but it doesn’t solve the other issues.
We’re also open to other products that might solve the issue. Any ideas?
Why would you ever need to add columns to Historical data? Anyway, what values would you assign to the 'old' rows.
An alternative to adding a column is to create a "parallel" table (aka "verdical partitioning"). The new table would have the same PRIMARY KEY as the original (except for any AUTO_INCREMENT declaration). You would use LEFT JOIN to fetch columns from both tables, and understand that 'old' rows would give you NULLs for the 'new' columns.
Another useful thing to do for Historical data is to treat it like a Fact table in Data Warehousing. The build and maintain "Summary table(s)" to significantly speed up common "report" type queries.
In newer versions of MySQL/MariaDB, ALTER TABLE ... ADD COLUMN ... ALGORITHM=INPLACE removes most of the performance pain.
Adding columns is also solved by moving toward EAV schema, which has a lot of bad qualities. So, move only part-way toward such. That is, keep the 5-10 main columns that you use for filtering and sorting as real columns, then put the rest of the key-value junk into a JSON column. Both MySQL and MariaDB had such (though with some differences), plus MariaDB has "Dynamic Columns".
Summary tables
EAV
"but it doesn’t solve the other issues" -- such as??

How to design a database where the main entity table has 25+ columns but a single entity's columns gets <20% filled on average?

The entities to be stored have 25+ properties (table columns). The entities are pretty diverse, meaning that, most of the columns are empty. On average, I'd say, less than 20% (<5) properties have a value in any particular item. So, I have a lot of redundant empty columns for most of the table rows. Almost all of the columns are decimal numbers.
Given this scenario, would you suggest serializing the columns instead, or perhaps, create another table named "Property", which would contain all the possible properties and then creating yet another table "EntityProperty" which would map an property to an entity using foreign keys? Or would you leave it as it is?
An example scenario where this kind of redundancy might occur could be the following:
We have an imaginary universe with lots of planets. We are creating a space mining game and each planet has 30 different mineral contents. Most of the planets have only 2-3 minerals.
The simplest solution would be to create a single table 'Planets' with 30 columns, one for each mineral. The problem here is that most of rows in the 'Planets' table have 25+ columns, in which each of one the value is null or zero. Thus we have lot of redundant data. Say, we would have 500k-1M records. I would guess it costs a byte at most to save a null or zero decimal value. Thus, we waste space 500,000-1,000,000 bytes, ie. one megabyte at most.
The other solution would be to create two additional tables. Instead of storing all the minerals in the 'Planets' table, we take them all out and create a table for the minerals called 'Minerals'. This would contain only 30 rows, one for each different mineral type. Then, we create a table called 'PlanetMineral' which contains a reference to a planet row and to a mineral row, and additionally this table would have a column telling the amount of the mineral the planet has. Apparently, in many database systems this complicates queries since you have to do possible several joins. I'm using SQL server with LINQ to SQL, which scaffolds the foreign key constraint into class object property, accessible through code. (ie. I can simply access the minerals a planet has with planet.Minerals) So, from this perspective it doesn't add complexity. The redundancy is a small portion (like 1/15) of the first solution. The reason there is still some overhead is because of the foreign keys we need to store.
As for the data query efficiency, I really don't know how the costs of the queries would compare between these two solutions.
It depends:
How many entities (rows) you are planning to have?
What kind of queries you run against that table?
Will there be a lot of new properties in future?
How are you planning to use the properties?
You seem to be concerned about wasting space with simple table? Try to calculate if space saving with other approaches are really significant and worthwhile. The disk is (usually) cheap.
If you have low number of rows, then the single table is probably better (it is easier to implement).
If you plan to create complex queries against the properties (eg. where property1 < 123) then the simple table is probably easier.
If you are planing to add lot of new properties in the future then the Property/EntityProperties approach could be useful.
I'd go with the simple one table approach because you have a rather small amount of rows (<1M), you are probably running your database with server machines and not some handheld/mobile thing (SQLServer) and your database schema is rather rigid.
For numbers, I would personally leave it as is, in 1 table. Numbers are compressed into a few bytes, and the overhead for having an EntityProperty table would far outweight that. Serializing is an option, but it means you cannot use SQL to search or compute the properties, you have to get the data, deserialise, and then compute.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.