Is it better to have many columns, or many tables? - mysql

Imagine a hypothetical database, which is storing products. Each product have have 100 attributes, although any given product will only have values set for ~50 of these. I can see three ways to store this data:
A single table with 100 columns,
A single table with very few (say the 10 columns that have a value for every product), and another table with columns (product_id, attribute, value). I.e, An EAV data store.
A separate table for every columns. So the core products table might have 2 columns, and there would be 98 other tables, each with the two columns (product_id, value).
Setting aside the shades of grey between these extremes, from a pure efficiency standpoint, which is best to use? I assume it depends on the types of queries being run, i.e. if most queries are for several attributes of a product, or the value of a single attribute for several products. How does this affect the efficiency?
Assume this is a MySQL database using InnoDB, and all tables have appropriate foreign keys, and an index on the product_id. Imagine that the attribute names and values are strings, and are not indexed.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
I found a similar question here: Best to have hundreds of columns or split into multiple tables?
The difference is, that question is asking about a specific case, and doesn't really tell me about efficiency in the general case. Other similar questions are all talking about the best way to organize the data, I just want to know how the different organizational systems impact the speed of queries.

In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
JOIN will be slower.
However, if you usually query only a specific subset of columns, and this subset is "vertically partitioned" into its own separate table, querying such "lean" table is typically quicker than querying the "fat" table with all the columns.
But this is very specific and fragile (easy to break-apart as the system evolves) situation and you should test very carefully before going down that path. Your default starting position should be one table.

In general, the more tables you have, the more normalised, more correct, and hence better (ie: reduced redundancy of data) your design.
If you later find you have problems with reporting on this data, then that may be the moment to consider creating denormalised values to improve any specific performance issues. Adding denormalised values later is much less painful than normalising an existing badly designed database.
In most cases, EAV is a querying and maintenance nightmare.
An outline design would be to have a table for Products, a table for Attributes, and a ProductAttributes table that contained the ProductID and the AttributeID of the relevant entries.

As you mentioned - it strictly depends on queries, which will be executed on these data. As you know, joins are aggravating for database. I can't imagine to make 50-60 joins for simple data reading. In my humble opinion it would be madness. :) The best thing, you can do is to introduce test data and check out your specific queries in tool as Estimated Execution Plan in Management Studio. There should exist similar tool for MySQL.
I would tend to advice you to avoid creating so much tables. I think, it have to cause problems in future. Maybe it is possible to categorise rarely used data for separate tables or to use complex types? For string data you can try to use nonclustered indexes.

Related

Is the performance of join two one-to-one tables on a single-node database the same as a pure select on an equivalent denormalized table?

There are two big (millions of records) one-to-one tables:
course
prerequisite with a foreign key reference to the course table
in single-node relational MySQL database. A join is needed to list the full description of all the courses.
An alternative is to have only one single table to contain both the course and prerequisite data in the same database.
Question: is the performance of the join query still slower than that of a simple select query without join on the single denormalized table albeit the fact that they are on the same single-node MYSQL database?
It's true that denormalization is often done to shorten the work to look up one record with its associated details. This usually means the query responds in less time.
But denormalization improves one query at the expense of other queries against the same data. Making one query faster will often make other queries slower. For example, what if you want to query the set of courses that have a given prerequisite?
It's also a risk when you use denormalization that you create data anomalies. For example, if you change a course name, you would also need to update all the places where it is named as a prerequisite. If you forget one, then you'll have a weird scenario where the obsolete name for a course is still used in some places.
How will you know you found them all? How much work in the form of extra queries will you have to do to double-check that you have no anomalies? Do those types of extra queries count toward making your database slower on average?
The purpose of normalizing a database is not performance. It's avoiding data anomalies, which reduces your work in other ways.

Use many tables with few columns or few tables with many columns in MySQL? [duplicate]

I'm setting up a table that might have upwards of 70 columns. I'm now thinking about splitting it up as some of the data in the columns won't be needed every time the table is accessed. Then again, if I do this I'm left with having to use joins.
At what point, if any, is it considered too many columns?
It's considered too many once it's above the maximum limit supported by the database.
The fact that you don't need every column to be returned by every query is perfectly normal; that's why SELECT statement lets you explicitly name the columns you need.
As a general rule, your table structure should reflect your domain model; if you really do have 70 (100, what have you) attributes that belong to the same entity there's no reason to separate them into multiple tables.
There are some benefits to splitting up the table into several with fewer columns, which is also called Vertical Partitioning. Here are a few:
If you have tables with many rows, modifying the indexes can take a very long time, as MySQL needs to rebuild all of the indexes in the table. Having the indexes split over several table could make that faster.
Depending on your queries and column types, MySQL could be writing temporary tables (used in more complex select queries) to disk. This is bad, as disk i/o can be a big bottle-neck. This occurs if you have binary data (text or blob) in the query.
Wider table can lead to slower query performance.
Don't prematurely optimize, but in some cases, you can get improvements from narrower tables.
It is too many when it violates the rules of normalization. It is pretty hard to get that many columns if you are normalizing your database. Design your database to model the problem, not around any artificial rules or ideas about optimizing for a specific db platform.
Apply the following rules to the wide table and you will likely have far fewer columns in a single table.
No repeating elements or groups of elements
No partial dependencies on a concatenated key
No dependencies on non-key attributes
Here is a link to help you along.
That's not a problem unless all attributes belong to the same entity and do not depend on each other.
To make life easier you can have one text column with JSON array stored in it. Obviously, if you don't have a problem with getting all the attributes every time. Although this would entirely defeat the purpose of storing it in an RDBMS and would greatly complicate every database transaction. So its not recommended approach to be followed throughout the database.
Having too many columns in the same table can cause huge problems in the replication as well. You should know that the changes that happen in the master will replicate to the slave.. for example, if you update one field in the table, the whole row will be w

Is it better to divide a single data into separate tables or keep in single table for best performance

I have seen few questions related to this but I felt they weren't exactly the same situation. This is also not a normalization related question.
Lets assume that we have a product which has some properties such as name,description,price,last_update_date,stock_amount
Lets assume, there will never be 2 different prices or stocks etc. for these 'product's and we don't have to keep historic data etc.
From a performance point of view, would it be better to keep all of these data in a single table? or divide it into seperate tables? such as:
products -> id, name, last_update_date, stock_amount, price
product_info -> id, products_id, description
I know data is not divided very logically but that is besides the point right now.
I can think of 2 arguments perhaps,
If you separate data into 2 tables, for example to update description, one would need to find products_id then update the data, which may cost more. On the other hand the products table's storage footprint would be so much smaller. Does this help in efficiency when finding the product, for example by name? or since we would have an index for 'name' it wouldn't matter how big the table is on disk?
Well, if everything was in one table, we wouldn't need to work on separate tables and this may increase efficiency?
What do you think? and what do you base your opinion on? Links and benchmark results are welcome.
Thanks!
If everything is a 1-to-1 mapping, there's no strong reason not to keep it all in one table. You should still have an ID column, so that if you have other data that's 1-to-many or many-to-many, you can refer to the products by ID in those tables.
However, one benefit of splitting it up into different tables can be improved concurrency. If everything is in one table, then an update to that table will lock the entire row (or the entire table if you use MyISAM). If you split it into multiple tables, then an update to one of the those tables won't interfere with queries that use the other tables.
I think efficiency is better with a single table. Two tables may be useful for further scalability.

Which of these 2 MySQL DB Schema approaches would be most efficient for retrieval and sorting?

I'm confused as to which of the two db schema approaches I should adopt for the following situation.
I need to store multiple attributes for a website, e.g. page size, word count, category, etc. and where the number of attributes may increase in the future. The purpose is to display this table to the user and he should be able to quickly filter/sort amongst the data (so the table strucuture should support fast querying & sorting). I also want to keep a log of previous data to maintain a timeline of changes. So the two table structure options I've thought of are:
Option A
website_attributes
id, website_id, page_size, word_count, category_id, title_id, ...... (going up to 18 columns and have to keep in mind that there might be a few null values and may also need to add more columns in the future)
website_attributes_change_log
same table strucuture as above with an added column for "change_update_time"
I feel the advantage of this schema is the queries will be easy to write even when some attributes are linked to other tables and also sorting will be simple. The disadvantage I guess will be adding columns later can be problematic with ALTER TABLE taking very long to run on large data tables + there could be many rows with many null columns.
Option B
website_attribute_fields
attribute_id, attribute_name (e.g. page_size), attribute_value_type (e.g. int)
website_attributes
id, website_id, attribute_id, attribute_value, last_update_time
The advantage out here seems to be the flexibility of this approach, in that I can add columns whenever and also I save on storage space. However, as much as I'd like to adopt this approach, I feel that writing queries will be especially complex when needing to display the tables [since I will need to display records for multiple sites at a time and there will also be cross referencing of values with other tables for certain attributes] + sorting the data might be difficult [given that this is not a column based approach].
A sample output of what I'd be looking at would be:
Site-A.com, 232032 bytes, 232 words, PR 4, Real Estate [linked to category table], ..
Site-B.com, ..., ..., ... ,...
And the user needs to be able to sort by all the number based columns, in which case approach B might be difficult.
So I want to know if I'd be doing the right thing by going with Option A or whether there are other better options that I might have not even considered in the first place.
I would recommend using Option A.
You can mitigate the pain of long-running ALTER TABLE by using pt-online-schema-change.
The upcoming MySQL 5.6 supports non-blocking ALTER TABLE operations.
Option B is called Entity-Attribute-Value, or EAV. This breaks rules of relational database design, so it's bound to be awkward to write SQL queries against data in this format. You'll probably regret using it.
I have posted several times on Stack Overflow describing pitfalls of EAV.
Also in my blog: EAV FAIL.
Option A is a better way ,though the time may be large when alert table for adding a extra column, querying and sorting options are quicker. I have used the design like Option A before, and it won't take too long when alert table while millions records in the table.
you should go with option 2 because it is more flexible and uses less ram. When you are using option1 then you have to fetch a lot of content into the ram, so will increases the chances of page fault. If you want to increase the querying time of the database then you should defiantly index your database to get fast result
I think Option A is not a good design. When you design a good data model you should not change the tables in a future. If you domain SQL language, using queries in option B will not be difficult. Also it is the solution of your real problem: "you need to store some attributes (open number, not final attributes) of some webpages, therefore, exist an entity for representation of those attributes"
Use Option A as the attributes are fixed. It will be difficult to query and process data from second model as there will be query based on multiple attributes.

Is there a performance decrease if there are too many columns in a table?

Is there a performance cost to having large numbers of columns in a table, aside from the increase in the total amount of data? If so, would splitting the table into a few smaller ones help the situation?
I don't agree with all these posts saying 30 columns smells like bad code. If you've never worked on a system that had an entity that had 30+ legitimate attributes, then you probably don't have much experience.
The answer provided by HLGEM is actually the best one of the bunch. I particularly like his question of "is there a natural split....frequently used vs. not frequently used" are very good questions to ask yourself, and you may be able to break up the table in a natural way (if things get out of hand).
My comment would be, if your performance is currently acceptable, don't look to reinvent a solution unless you need it.
I'm going to weigh in on this even though you've already selected an answer. Yes, tables that are too wide could cause performance problems (and data problems as well) and should be separated out into tables with one-one relationships. This is due to how the database stores the data (well at least in SQL Server not sure about MySQL but it is worth doing some reading in the documentation about how the database stores and accesses the data).
Thirty columns might be too wide and might not, it depends on how wide the columns are. If you add up the total number of bytes that your 30 columns will take up, is it wider than the maximum number of bytes that can be stored in a record?
Are some of the columns ones you will need less often than others (in other words is there a natural split between required and frequently used info and other stuff that may appear in only one place not everywhere else), then consider splitting up the table.
If some of your columns are things like phone1, phone2, phone3 - then it doesn't matter how many columns you have you need a related table with a one-to-many relationship instead.
In general, though 30 columns are not unusually big and will probably be OK.
If you really need all those columns (that is, it's not just a sign that you have a poorly designed table) then by all means keep them.
It's not a performance problem, as long as you
use appropriate indexes on columns you need to use to select rows
don't retrieve columns you don't need in SELECT operations
If you have 30, or even 200 columns it's no problem to the database. You're just making it work a little harder if you want to retrieve all those columns at once.
But having a lot of columns is a bad code smell; I can't think of any legitimate reason a well-designed table would have this many columns and you may instead be needing a one-many relationship with some other, much simpler, table.
Technically speaking, 30 columns is absolutely fine. However, tables with many columns are often a sign that your database isn't properly normalized, that is, it can contain redundant and / or inconsistent data.
30 doesn't seem too many to me. In addition to necessary indexes and proper SELECT queries, for wide tables, 2 basic tips apply well:
Define your column as small as possible.
Avoid using dynamic columns such as VARCHAR or TEXT as much as possible when you have large number of columns per table. Try using fixed length columns such as CHAR. This is to trade off disk storage for performance.
For instance, for columns 'name', 'gender', 'age', 'bio' in 'person' table with as many as 100 or even more columns, to maximize performance, they are best to be defined as:
name - CHAR(70)
gender - TINYINT(1)
age - TINYINT(2)
bio - TEXT
The idea is to define columns as small as possible and in fixed length where reasonably possible. Dynamic columns should be to the end of the table structure so fixed length columns are ALL before them.
It goes without saying this would introduce tremendous disk storage wasted with large amount of rows, but as you want performance I guess that would be the cost.
Another tip is as you go along you would find columns that are much more frequently used (selected or updated) than the others, you should separate them into another table to form a one to one relationship to the other table that contains infrequent used columns and perform the queries with less columns involved.
Should be fine, unless you have select * from yourHugeTable all over the place. Always only select the columns you need.
30 columns would not normally be considered an excessive number.
Three thousand columns, on the other hand...
How would you implement a very wide "table"?
Beyond performance, DataBase normalization is a need for databases with too many tables and relations. Normalization gives you easy access to your models and flexible relations to execute diffrent sql queries.
As it is shown in here, there are eight forms of normalization. But for many systems, applying first, second and third normal forms is enough.
So, instead of selecting related columns and write long sql queries, a good normalized database tables would be better.
Usage wise, it's appropriate in some situations, for example where tables serve more than one application that share some columns but not others, and where reporting requires a real-time single data pool for all, no data transitions. If a 200 column table enables that analytic power and flexibility, then I'd say "go long." Of course in most situations normalization offers efficiency and is best practice, but do what works for your need.