When to add a column vs adding a related table? - mysql

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).

At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

Related

Table with many columns or many small tables?

I created a table where it has 30 columns.
CREATE TABLE "SETTINGS" (
"column1" INTEGER PRIMARY KEY,
...
...
"column30"
)
However, I can group them and create different table where they can have foreign keys to the primary table. Which is the best way to follow? Or the number of the columns is small so it's the same which way I will follow?
It depends on the data and the query you often do.
Best for one big table
If you need to extract all the columns always
If you need to update many fields at the same time
If all the fields or quite all have not null values
Best for many little tables
If the data are "sparse" it means not many columns have values you can imagine to split them in different tables and create a record in a child table only if not null values exists
If you extract only few related fields at one time
If you update only related fields at one time
Better names for each column (for example instead of domicile_address and residence_address you can have two columns with named address in two tables residences and domiciles)
The problem is that generally you can use both solutions depending from the situation. A usage analysis must be done to choose the right one.
If they really are named column1, column2....column30 then that's a fairly good indicator that your data is not normalized.
Following the rules of normalization should always be your starting point. There are sometimes reasons for breaking the rules - but that comes after you know what your data should look like.
Regarding breaking the rules.....there are 2 potential scenarios which may apply here (after you've worked out the correct structure and done an appropriate level of testing with an appropriate volume of data):
Joins are expensive. holding parent/child relations in a single table can improve query performance where you are routinely selecting only both parent and child and retrieving individual rows
unless you are using fixed width MyISAM tables, updating records can result in them changing size, and hence they have to relocated in the table data file. This is expensive and can have a knock on effect on retrieval.

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

How many fields is normal to have in one table?

Ok, I am creating a game, I have one table where I save a lot of information about a member, so I have many field in it. How many fields is normal to have in one table? Does it matter? Maybe I should split that info into two-three-four tables? What do you think?
Normalize the Database
If you feel you have too many columns, you probably have repeating groups, which suggests you should normalize the database. See an example here: Description of the database normalization basics
Hard MySQL Limits
MySQL 5.5 Column Count Limit
Every table has a maximum row size of 65,535 bytes.
There is a hard limit of 4096 columns
per table
Splitting of data into tables should generally not be dictated by the number of columns, but by the nature of the data. The process of splitting a large table into smaller ones is called normalization.
The only other reason I can think of to split a table is, if you may need data in clusters, i.e. you often need columns A-D together or columns E-L, but never all columns or columns D-F, then you can split the table into two tables, one containing columns A-D and the primary key, the other one containing columns E-L and the primary key.
Speaking about limits, MySQL says it's 4096 (source).
Yet I haven't seen so big tables yet, even those huge data mining tables don't come close.
You shouldn't be concerned about it as long as your database is normalized. As soon as you can spot same data being stored twice (for example, player table might have player_type column storing some fixed values), it's worth moving such data to separate table, instead od duplicating information in other tables and hence reducing columns count (but that's only "side effect" of normalization, not something that drives it).
I've never personally encountered one with more than 500 columns in it, but short of the maximum sizes there's no reason to have any fewer than the design demands. Beware of doing SELECT * FROM it though.
"information about a member" - umm always difficult, but I always separate identifiable information into another table and use a salt key to link the 2 together. That way it is not as easy to "hijack" usernames and passwords etc. And you can always use the SALT as a session variable rather than username/password/userId or whatever.
Typically I only store a ID, salt and joining date in 1 table. As I said, the rest I try to "hide" so that they cannot be "linked/hijacked".
Hope helps

Is it better to break the table up and create new table which will have 3 columns + Foreign ID, or just make it n+3 columns and use no joins?

I am designing a database for a project. I have a table that has 10 columns, most of them are used whenever the table is accessed, and I need to add 3 more columns;
View Count
Thumbs Up (count)
Thumbs Down (Count)
which will be used on %90 of the queries when the table is accessed. So, my question is that whether it is better to break the table up and create new table which will have these 3 columns + Foreign ID, or just make it 13 columns and use no joins?
Since these columns will be used frequently, I guess adding 3 more columns is better, but if I need to create 10 more columns which will be used %90 of the time, should I add them as well, or create a new table and use joins?
I am not sure when to break the table if the columns are used very frequently. Do you have any suggestions?
since it's such a high number of usage cases (90%) and the fields are only numbers (not text) then i would certainly be inclined to just add the fields to the existing table.
edit: only break tables apart if the information is large and/or infrequently accessed. there's no fixed rule, you might just have to run tests if you're unsure as to the benefits.
Space is not a big deal these days - I'd say that the decision to add columns to a table should be based on "are the columns directly related to the table", not "how often will the columns be used".
So basically, yes, add them to the table. For further considerations on mainstream database design, see 3NF.
The frequency of usage should be of no concern for your table layout, at least not until you start with huge tables (in number of rows or columns)
The question to answer is: Is it normalized with the additional columns. Google it, there a plenty of resources about it (with varying quality though)
Ditto some earlier posters. 95% of the time, you should design your tables based on logical entities. If you have 13 data elements that all describe the same "thing", than they all belong in one table. Don't break them into multiple tables based on how often you expect them to be used or to be used together. This usually creates more problems than it solves.
If you end up with a table that has some huge number of very large fields, and only a few of them are normally used, and it's causing a performance problem, then you might consider breaking it up. But you should only do that when you see that it really is causing a performance problem. Pre-emptive strikes in this area are almost always a mistake.
In my experience, the only time breaking up a table for performance reasons has shown any value is when there are some rarely-used, very large text fields. Like when there's a "Miscellaneous Extra Comments" field or "Text of the novel this customer is writing".
My advice is the same as cedo's:go with the 13 columns.
Adding another table to the DB, with another Index might just eat up the space you saved but will result in slower and more complicated queries.
Try looking into Database Normalization for some clearly outlined guidelines for planning database structures.

Should one steer clear of adding yet another field to a larger MySQL table?

I have a MySQL-InnoDB table with 350,000+ rows, containing a couple of things like id, otherId, shortTitle and so on. Now I'm in need of a Bool/ Bit field for perhaps a couple of hundreds or thousands of those rows. Should I just add that bool field into the table, or should I best create a new table referencing the IDs of the old table -- thereby not risking to cause performance issues on all the old existing functions that access the first table?
(Side info: I'm never using "SELECT * ...". The main table has lots of reading, rarely writing.)
Adding a field can indeed hamper performance a little, since your table row grow larger, but it's hardly a problem for a BIT field.
Most probably, you will have exactly same row count per page, which means having no performance decrease at all.
On the other hand, using an extra JOIN to access the row value in another table will be much slower.
I'd add the column right into the table.
What does the new column denote?
From the data modelling perspective, if the column belongs with the data under whichever normal form is in use, then put it with the data; performance impact be damned. If the column doesn't directly belong to the table, then put it in a second table with a foreign key.
Realistically, the performance impact of adding a new column on a table with ~350,000 isn't going to be particularly huge. Have you tried issuing the ALTER TABLE statement against a copy, perhaps on a local workstation?
I don't know why people insist in called 350K-row tables big. In the mainframe world, that's how big the DBMS configuration tables are :-).
That said, you should be designing your tables in third normal form. If, and only if, you have performance problems, then should you consider de-normalizing.
If you have a column that will apply only to certain of the rows, it's (probably) not going to be 3NF to put it in the same table. You should have a separate table with a foreign key into your 'primary' table.
Keep in mind that's if the boolean field actually doesn't apply to some of the rows. That's a different situation to the field applying to all rows but not being known for some. In that case, a nullable column in the primary table would be better. But that doesn't sound like what you're describing.
Requiring a bit field for the next entries only sounds like you want to implement inheritance. If that is the case, I would add it to a new table to keep things readable. Otherwise, it doesn't matter if you add it to the main table or not, unless your queries are not using indexes, in which case I would change that first before making any other decisions regarding performance.