Sparse column vs. Indirection - mysql

As I understand, MySQL doesn't have any special SPARSE COLUMN directive, and any question like this is purely situational, so I'm wondering if there is a good rule of thumb for when to use a sparse column vs. when to create another table.
As a specific example, I have a table called Lessons. We want to add a lesProgramNumber, but this will only apply to about 10% of all lessons at any given time (it will be NULL for the other 90%). We could avoid a lot of NULL data by having a LessonsProgramNumber table very easily, but then this requires an additional JOIN at times. Is there an easy way to make a choice about what I need? What if Lessons only has 500 rows? What if it has 500 million?

InnoDB's COMPACT row format (which is the default) does not store anything when a column is NULL; it only stores non-null columns. It just skips that column in the on-disk row storage. So the cost of sparse columns is not so bad.
See http://dev.mysql.com/doc/refman/5.1/en/innodb-physical-record.html

Related

When to add a column vs adding a related table?

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

MySQL multiple bit columns or 1 enum column

I'm designing DB tables for a log system. I have two ideas on my mind about a field. Should I create three "bit(1)" property or one "enum" property?
is_error bit(1)
is_test bit(1)
is_embedded bit(1)
or
boolErrors enum(is_error_true, is_error_false, is_test, is_test_false, is_embedded_ is_embedded_false)
Obviously, holding enum seems not proper in semantics and space but what about performance. Is fetching time increases when i have 3 columns instead of 1?
If, as it seems, the flags represent states (that is, only one flags may be true at a given point in time), then I would recommend a single column, as integer datatype. Instead of using ENUM, you can use a referrential table to store all possible flags and their names, an reference it from the original table, using the integer column.
On the other hand, if several flags may be on (say, both is_error and is_test), then a single column is not sufficient. You can either create several columns (if the list of flags never changes), or use a bridge table to store each status on a separate row.
If only one of those flags can be set at a time, use ENUM.
If multiple flags can be set at the same time, use SET.
Performance is not really something to worry about. The main "cost" in working with a row in a table is fetching the row, not the details of what goes on in the columns.
Sure, "smaller is better" for several reasons -- I/O, etc. But an ENUM is 1 or 2 bytes; a SET is up to 8 bytes (for up to 64 flags). Both of those are reasonably small for any use case.
As for speed and indexability, let's see the main queries.

long many to many-database table: best performance practice

I have a question about performance of my MYSQL database design.
Table A has a lot of records, say a million, and table B also has a million. There is another table C in which every record id of A is connected to every row in B and this connection has an additional value 1 or 0. So functionally speaking, every record in A has a boolean vector where B contains the 'variables' of the vector and 1 or 0 is the value. It's explained more graphically in the image on bottom.
Table C will have a lot of write and read actions (select all values from a record of A), so the the table is very actively used. And table C is really long, a million times a million rows.
My first question is, will the length of the table give a performance
issue? the database needs to be really fast.
My second question is, if this is badly designed, whether there is a better design to achieve what i want. For instance I can think of storing the entire B vector of each A record inside of each row in A. Then table C will not be necessary. But it will make selecting, reading, writing much more difficult.
The table design is fine and shouldn't be a problem, because you access records via IDs which should be indexed. Depending on your typical queries you should also consider adding composite indexes (c(a_id,b_id), c(a_id,value), c(b_id,value), c(a_id,b_id,value)).
However, as there exist only two states, 0 and 1, you may decide only to store one of them. I.e. if you store all state 1 records only, all pairs not in the table have state 0 then implicitly. This pays especially when the states are unevenly distributed (say 90% of the records have state 0 and only 10% have state 1) or you usually access only one of the states (e.g. you always look for 1s).
Answer to your first question
Millions of records in a table with multiple read and write won't be a
bottleneck if you are following best practices of mysql.
Your engine should be innodb.
Your select queries should not involve a full table scan.
Your table should have desired indexes.
Answer to your second question
You should look for all your possible use cases, because either way is
a good idea if a use case supports it.
If you split your data across multiple tables than join operation is
to be performed if needed.

How to design a database where the main entity table has 25+ columns but a single entity's columns gets <20% filled on average?

The entities to be stored have 25+ properties (table columns). The entities are pretty diverse, meaning that, most of the columns are empty. On average, I'd say, less than 20% (<5) properties have a value in any particular item. So, I have a lot of redundant empty columns for most of the table rows. Almost all of the columns are decimal numbers.
Given this scenario, would you suggest serializing the columns instead, or perhaps, create another table named "Property", which would contain all the possible properties and then creating yet another table "EntityProperty" which would map an property to an entity using foreign keys? Or would you leave it as it is?
An example scenario where this kind of redundancy might occur could be the following:
We have an imaginary universe with lots of planets. We are creating a space mining game and each planet has 30 different mineral contents. Most of the planets have only 2-3 minerals.
The simplest solution would be to create a single table 'Planets' with 30 columns, one for each mineral. The problem here is that most of rows in the 'Planets' table have 25+ columns, in which each of one the value is null or zero. Thus we have lot of redundant data. Say, we would have 500k-1M records. I would guess it costs a byte at most to save a null or zero decimal value. Thus, we waste space 500,000-1,000,000 bytes, ie. one megabyte at most.
The other solution would be to create two additional tables. Instead of storing all the minerals in the 'Planets' table, we take them all out and create a table for the minerals called 'Minerals'. This would contain only 30 rows, one for each different mineral type. Then, we create a table called 'PlanetMineral' which contains a reference to a planet row and to a mineral row, and additionally this table would have a column telling the amount of the mineral the planet has. Apparently, in many database systems this complicates queries since you have to do possible several joins. I'm using SQL server with LINQ to SQL, which scaffolds the foreign key constraint into class object property, accessible through code. (ie. I can simply access the minerals a planet has with planet.Minerals) So, from this perspective it doesn't add complexity. The redundancy is a small portion (like 1/15) of the first solution. The reason there is still some overhead is because of the foreign keys we need to store.
As for the data query efficiency, I really don't know how the costs of the queries would compare between these two solutions.
It depends:
How many entities (rows) you are planning to have?
What kind of queries you run against that table?
Will there be a lot of new properties in future?
How are you planning to use the properties?
You seem to be concerned about wasting space with simple table? Try to calculate if space saving with other approaches are really significant and worthwhile. The disk is (usually) cheap.
If you have low number of rows, then the single table is probably better (it is easier to implement).
If you plan to create complex queries against the properties (eg. where property1 < 123) then the simple table is probably easier.
If you are planing to add lot of new properties in the future then the Property/EntityProperties approach could be useful.
I'd go with the simple one table approach because you have a rather small amount of rows (<1M), you are probably running your database with server machines and not some handheld/mobile thing (SQLServer) and your database schema is rather rigid.
For numbers, I would personally leave it as is, in 1 table. Numbers are compressed into a few bytes, and the overhead for having an EntityProperty table would far outweight that. Serializing is an option, but it means you cannot use SQL to search or compute the properties, you have to get the data, deserialise, and then compute.

Maximum number of columns in a table

Problem1: What is the maximum no of columns we can have in a table
Problem2: What is the maximum no of columns we should have in a table
Answer 1: Probably more than you have, but not more than you will grow to have.
Answer 2: Fewer than you have.
Asking these questions usually indicates that you haven't designed the table well. You probably are practicing the Metadata Tribbles antipattern. The columns tend to accumulate over time, creating an unbounded set of columns that store basically the same type of data. E.g. subtotal1, subtotal2, subtotal3, etc.
Instead, I'm guessing you should create an additional dependent table, so your many columns become many rows. This is part of designing a proper normalized database.
CREATE TABLE Subtotals (
entity_id INT NOT NULL,
year_quarter SMALLINT NOT NULL, -- e.g. 20094
subtotal NUMERIC(9,2) NOT NULL,
PRIMARY KEY (entity_id, year_quarter),
FOREIGN KEY (entity_id) REFERENCES Entities (entity_id)
);
My former colleague also wrote a blog about this:
Understanding the maximum number of columns in a MySQL table
The answer is not so straightforward as you might think.
SQL 2000 : 1024
SQL 2005 : 1024
SQL 2008 : 1024 for a non-wide table, 30k for a wide table.
The wide tables are for when you have used the new sparse column feature in SQL 2008 which is designed for when you have a large number of columns that are normally empty.
Just because these limits are available, does not mean you should be using them however, I would start with designing the tables based on the requirements and then check whether a vertical partitioning of 1 table into 2 smaller tables is required etc.
1)
http://msdn.microsoft.com/en-us/library/aa933149%28SQL.80%29.aspx
1024 seems to be the limit.
2)
Much less than 1024 :). Seriously, it depends on how normalized you want your DB to be. Generally, the fewer columns you have in a table the easier it will be for someone to understand (normally). Like for a person table, you might want to store the person's address in another table (person_address, for example). It's best to break your data up into entities that make sense for your business model, and go from there.
2) There are plenty of guidelines out there. In particular regarding database normalization. The overarching principle is always to be able to adapt. Similar to classes, tables with large number of columns are not very flexible. Some of the questions you should ask yourself:
Does Column A describes an attribute of the Object (Table) that could/should be grouped with Column B
Data update. Keep in mind that most RDBMS perform a row lock when updating values. This means that if you are constantly updating Column A while another process is updating Column B, they will fight for the row and this will create contention.
Database design is an art more than a science. While guidelines and technical limitations will get you in the right direction, there are no hard rules that will make your system work or fail 100%.
I think 4096 in mysql, SQL Server I don't know
I asked the same question a few months ago in a special scenario, maybe the answers help you decide. Usually, as few as possible I would say.