Optional fields on very large db - mysql

I am currently working on the redesign of a mysql bdd.
This database is actually a huge table of 200 fields, several million lines, almost no index... In short, disastrous performances and huge ram consumption!
I first managed to reduce the number of fields by setting up 1:n relationships.
I have a question on a specific point:
A number of fields on this database are optional and rarely filled in. (sometimes almost never)
What is the best thing to do in this case?
leave the field in the table even if it is very often of null value
set up a n:n relationship knowing that these relationships, if they exist, will only return one line
...or another solution I haven't thought of
Thank you in advance for your wise advice;)
Dimitri

My suggestion:
First of all, make sure to normalize your db. At least to 3rd Normal Form. This will probably reduce some of your original columns and split them over several tables.
Once that is done, in case you still have lots of 'optional and rarely filled' columns in some of your tables, consider how to proceed depending on your needs: What is most important to you? Disk space or Performance?
Take a look at MySQL Optimazing Data Size for extra tips/improvements for the re-design of your database ...depending on your needs.
Regarding 'set up a n:n relationship...' (I figure you meant 1:1 relationship) can be an interesting option in some cases:
In some circumstances, it can be beneficial to split into two a table that is scanned very often. This is especially true if it is a dynamic-format table and it is possible to use a smaller static format table that can be used to find the relevant rows when scanning the table (extracted from here)

Related

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

How many columns in MySQL table

How many fields are OK to have in a MySQL table?
I have a form to get information from users, it is divided into 6 or 7 unites, but all related to each other. They are about 100 fields.
Do you recommend to keep all of them in one table?
Thank you all for your responses,well my table's data are about some personal information of users,all the work that is needed to be done is to get them from the user,save and show them and have the edit option.
should i use the normalization? doesn't it increase the process time in my situation?
Providing you are following database normalization, you generally should be ok - although you may find some performance issues down the road.
To me, it seems like perhaps there could be some normalization?
Also, you should consider how many of these columns will just have null values, and what naming conventions you are using (not just name, name2 etc)
In case you want to read into it more.:
Data normalization basics
MySql :: An introduction to database normalization
MySQL supports up to 4096 columns per table. Dont use a field in main table if it often takes empty values.
if a column takes NULL no problem but if one stored as " " its memory wastage.
Too often null values means optional field data then why should one store it in main table.
Thats what i meant in the statement.
For data that does not need to be indexed or searched I create a single column in my database like col_meta which holds a serialized PHP array of values, it saves space and can expand or contract as needed. Just a thought.
There's no general rule to that. However, 100 columns definitely hints that your DB design is plain wrong. You should read about normalization and database design before continuing. I have a DB design with ~150 columns split up into nearly 40 tables, just to give you an idea.
If they all relate and can not be normalised (not to mention nothing > 1 is serialised) then I guess it will be OK.
Are you sure some things can't be split into multiple tables though? Can you provide a sample of the information you are gathering?
Could you split things into such as
user_details
user_car_details
...
I would typically start splitting them into separate tables after around 20-30 columns. However, it is really driven by how you are going to use the data.
If you always need all the columns, then splitting them out into separate tables will add unnecessary complication and overhead. However, if there are groups of columns that tend to be accessed together, then splitting along those lines with a 1-1 relationship between the tables might be useful.
This technique can be useful if performance is a concern, especially if you have large text columns in your table.
Lastly, I would echo the other posters comments about normalisation. It is relatively rare to have 100s of columns in one table and it may be that you need to normalise the table.
For example columns, SETTING_1, SETTING_2, SETTING_3, SETTING_4 might be better in a separate SETTINGS table, which has a 1-many relationship to the original table.
Thank you all,so i think it's better for me to merge some of my data into one column,which save me more space and decrease overhead
I've noticed a performance loss around 65 Columns in a DB with around 10 000 Rows. But could be lack of serverstructure or something else.

Is there a performance decrease if there are too many columns in a table?

Is there a performance cost to having large numbers of columns in a table, aside from the increase in the total amount of data? If so, would splitting the table into a few smaller ones help the situation?
I don't agree with all these posts saying 30 columns smells like bad code. If you've never worked on a system that had an entity that had 30+ legitimate attributes, then you probably don't have much experience.
The answer provided by HLGEM is actually the best one of the bunch. I particularly like his question of "is there a natural split....frequently used vs. not frequently used" are very good questions to ask yourself, and you may be able to break up the table in a natural way (if things get out of hand).
My comment would be, if your performance is currently acceptable, don't look to reinvent a solution unless you need it.
I'm going to weigh in on this even though you've already selected an answer. Yes, tables that are too wide could cause performance problems (and data problems as well) and should be separated out into tables with one-one relationships. This is due to how the database stores the data (well at least in SQL Server not sure about MySQL but it is worth doing some reading in the documentation about how the database stores and accesses the data).
Thirty columns might be too wide and might not, it depends on how wide the columns are. If you add up the total number of bytes that your 30 columns will take up, is it wider than the maximum number of bytes that can be stored in a record?
Are some of the columns ones you will need less often than others (in other words is there a natural split between required and frequently used info and other stuff that may appear in only one place not everywhere else), then consider splitting up the table.
If some of your columns are things like phone1, phone2, phone3 - then it doesn't matter how many columns you have you need a related table with a one-to-many relationship instead.
In general, though 30 columns are not unusually big and will probably be OK.
If you really need all those columns (that is, it's not just a sign that you have a poorly designed table) then by all means keep them.
It's not a performance problem, as long as you
use appropriate indexes on columns you need to use to select rows
don't retrieve columns you don't need in SELECT operations
If you have 30, or even 200 columns it's no problem to the database. You're just making it work a little harder if you want to retrieve all those columns at once.
But having a lot of columns is a bad code smell; I can't think of any legitimate reason a well-designed table would have this many columns and you may instead be needing a one-many relationship with some other, much simpler, table.
Technically speaking, 30 columns is absolutely fine. However, tables with many columns are often a sign that your database isn't properly normalized, that is, it can contain redundant and / or inconsistent data.
30 doesn't seem too many to me. In addition to necessary indexes and proper SELECT queries, for wide tables, 2 basic tips apply well:
Define your column as small as possible.
Avoid using dynamic columns such as VARCHAR or TEXT as much as possible when you have large number of columns per table. Try using fixed length columns such as CHAR. This is to trade off disk storage for performance.
For instance, for columns 'name', 'gender', 'age', 'bio' in 'person' table with as many as 100 or even more columns, to maximize performance, they are best to be defined as:
name - CHAR(70)
gender - TINYINT(1)
age - TINYINT(2)
bio - TEXT
The idea is to define columns as small as possible and in fixed length where reasonably possible. Dynamic columns should be to the end of the table structure so fixed length columns are ALL before them.
It goes without saying this would introduce tremendous disk storage wasted with large amount of rows, but as you want performance I guess that would be the cost.
Another tip is as you go along you would find columns that are much more frequently used (selected or updated) than the others, you should separate them into another table to form a one to one relationship to the other table that contains infrequent used columns and perform the queries with less columns involved.
Should be fine, unless you have select * from yourHugeTable all over the place. Always only select the columns you need.
30 columns would not normally be considered an excessive number.
Three thousand columns, on the other hand...
How would you implement a very wide "table"?
Beyond performance, DataBase normalization is a need for databases with too many tables and relations. Normalization gives you easy access to your models and flexible relations to execute diffrent sql queries.
As it is shown in here, there are eight forms of normalization. But for many systems, applying first, second and third normal forms is enough.
So, instead of selecting related columns and write long sql queries, a good normalized database tables would be better.
Usage wise, it's appropriate in some situations, for example where tables serve more than one application that share some columns but not others, and where reporting requires a real-time single data pool for all, no data transitions. If a 200 column table enables that analytic power and flexibility, then I'd say "go long." Of course in most situations normalization offers efficiency and is best practice, but do what works for your need.

Is it better to break the table up and create new table which will have 3 columns + Foreign ID, or just make it n+3 columns and use no joins?

I am designing a database for a project. I have a table that has 10 columns, most of them are used whenever the table is accessed, and I need to add 3 more columns;
View Count
Thumbs Up (count)
Thumbs Down (Count)
which will be used on %90 of the queries when the table is accessed. So, my question is that whether it is better to break the table up and create new table which will have these 3 columns + Foreign ID, or just make it 13 columns and use no joins?
Since these columns will be used frequently, I guess adding 3 more columns is better, but if I need to create 10 more columns which will be used %90 of the time, should I add them as well, or create a new table and use joins?
I am not sure when to break the table if the columns are used very frequently. Do you have any suggestions?
since it's such a high number of usage cases (90%) and the fields are only numbers (not text) then i would certainly be inclined to just add the fields to the existing table.
edit: only break tables apart if the information is large and/or infrequently accessed. there's no fixed rule, you might just have to run tests if you're unsure as to the benefits.
Space is not a big deal these days - I'd say that the decision to add columns to a table should be based on "are the columns directly related to the table", not "how often will the columns be used".
So basically, yes, add them to the table. For further considerations on mainstream database design, see 3NF.
The frequency of usage should be of no concern for your table layout, at least not until you start with huge tables (in number of rows or columns)
The question to answer is: Is it normalized with the additional columns. Google it, there a plenty of resources about it (with varying quality though)
Ditto some earlier posters. 95% of the time, you should design your tables based on logical entities. If you have 13 data elements that all describe the same "thing", than they all belong in one table. Don't break them into multiple tables based on how often you expect them to be used or to be used together. This usually creates more problems than it solves.
If you end up with a table that has some huge number of very large fields, and only a few of them are normally used, and it's causing a performance problem, then you might consider breaking it up. But you should only do that when you see that it really is causing a performance problem. Pre-emptive strikes in this area are almost always a mistake.
In my experience, the only time breaking up a table for performance reasons has shown any value is when there are some rarely-used, very large text fields. Like when there's a "Miscellaneous Extra Comments" field or "Text of the novel this customer is writing".
My advice is the same as cedo's:go with the 13 columns.
Adding another table to the DB, with another Index might just eat up the space you saved but will result in slower and more complicated queries.
Try looking into Database Normalization for some clearly outlined guidelines for planning database structures.