How many columns in table to keep? - MySQL - mysql

I am stuck between row vs columns table design for storing some items but the decision is which table is easier to manage and if columns then how many columns are best to have? For example I have object meta data, ideally there are 45 pieces of information (after being normalized) on the same level that i need to store per object. So is 45 columns in a heavry read/write table good? Can it work flawless in a real world situation of heavy concurrent read/writes?

If all or most of your columns are filled with data and this number is fixed, then just use 45 fields. It's nothing inherently bad with 45 columns.
If all conditions are met:
You have a possibility of the the attributes which are neither known nor can be predicted at design time
The attributes are only occasionally filled (say, 10 or less per entity)
There are many possible attributes (hundreds or more)
No attribute is filled for most entities
then you have a such called sparce matrix. This (and only this) model can be better represented with an EAV table.

"There is a hard limit of 4096 columns per table", it should be just fine.

Taking the "easier to manage" part of the question:
If the property names you are collecting do not change, then columns is just fine. Even if it's sparsely populated, disk space is cheap.
However, if you have up to 45 properties per item (row) but those properties might be radically different from one element to another then using rows is better.
For example taking a product catalog. One product might have color, weight, and height. Another might have a number of buttons or handles. These are obviously radically different properties. Further this type of data suggests that new properties will be added that might only be related to a particular set of products. In this case, rows is much better.
Another option is to go NoSql and utilize a document based database server. This would allow you to set the named "columns" on a per item basis.
All of that said, management of rows will be done by the application. This will require some advanced DB skills. Management of columns will be done by the developer at design time; which is usually easier for most people to get their minds around.

I don't know if I'm correct but I once read in MySQL to keep your table with minimum columns IF POSSIBLE, (read: http://dev.mysql.com/doc/refman/5.0/en/data-size.html ), do NOTE: this is if you are using MySQL, I don't know if their concept applies to other DBMS like oracle, firebird, posgresql, etc.
You could take a look at your table with 45 column and analyze what you truly need and leave the optional fields into other table.
Hope it helps, good luck

Related

MySQL design regarding a web

I am tackling a problem in class to design a mySQL representation of a web that stores a list of events associated with a person. So, for this table/tables, it would have 2 columns, one of which is the person's name and the other is the event. However, a person will generally have anywhere from 30-1000 events, so this table, which we plan to have for our entire undergraduate class of 6000 students, will have millions of entries. Is there a better way to store this in mySQL that will take less space, but will still be able to retrieve individual events and the list of people that attended it just as easily as if it was a table of two columns?
Yes, there is a technique called many-to-many, and essentially breaks your one table into three, which is critical when you consider that there are indeed exactly three entities being modeled (as a good sanity check)
Person
Event
A Person's association with an Event
You model this as three tables, with the first two having essentially two columns each: one with a unique index (called "primary key"), and the second being a semantic name (person name, event name). Note that you can also add any number of columns to these with only one factor of increased storage (most likely your first move will be to add a date column to the event table).
The third table is the interesting one, it contains only 2 columns, each numeric, both of which are references to the other tables (each row is simply: (person_id, event_id)). We term these "foreign keys".
This structure means a few things:
No matter how many events someone goest to, that someone is only represented once.
same with events, not matter how many attendees
The attendance is a "first-class" entity, and can grow to include it's own attributes (i.e. "role")
This structure is called many-to-many because each person may attend many events, and each event may have many attendees.
The quintessential feature of the design is that no single piece of domain knowledge is repeated, only "keys" are repeated as necessary to model the real-world domain. (i.e. in your first example, accounting for a name change would require an unknown quantity of updates, and might lead to data anomalies, avoidance of which is a primary concern of database normalization.
Don't worry about "space". This isn't the 1970s and we're not going to run out of columns on punch cards to store data. You should be concerned with expressing your requirements in the proper, most normalized data structure. With proper indexing there shouldn't be a problem, not with this volume of data.
Remember indexes need to be defined on anything you will include as part of a WHERE clause, and sometimes you may need to add additional indexes for large lists fetched with ORDER BY and LIMIT.
Whenever possible or practical use an integer identifier instead of a string. These are stored as a small number of bytes, typically 4, compared with a variable length string which is typically at least the length of the string in bytes plus 1.
A properly normalized database will use numerical identifiers for things anyway, so this kind if thing isn't a huge concern. The only time you go against this, or deliberately de-normalize your data, is when you have a legitimate performance problem that cannot be easily solved using some other method.
As always, test your schema by generating large amounts of dummy data and see how it performs. Since you have a good idea of the requirements in advance, do some testing at those levels, and then, to be on the safe side, try 2x, 5x and 10x the data to see how much flexibility your design has. It's okay to have performance limitations so long as you know at what kind of scale you'll experience them.
mySQL relational databases were designed specifically to handle this sort of problem. Handling millions of entries is not a problem. Complex queries may take a couple seconds but will perform remarkably well.
It is best design to store 1 event per row. The way you are going about it sounds like the best way. Good Luck.

More efficient to have two tables or one table with tons of fields

Related but not quite the same thing:which is more effcient? (or at least reading through it didn't help me any)
So I am working on a new site (selling insurance policies) we already have several sites up (its a rails application) that do this so I have a table in my sql database called policies.
As you can imagine it has lots of columns to support all the different options available.
While working on this new site I realized I needed to keep track of 20+ more options.
My concern is that the policies table is already large, but the columns in it right now are almost all used by every application we have. Whereas if I add these they would only be used for the new site and would leave tons of null cells on all the rest of the policies.
So my question is do I add those to the existing table or create a new table just for the policies sold on that site? Also I believe that if I created a new table I could leave out some of the columns (but not very many) from the main policies table because they are not needed for this application.
"[A]lmost all used" suggests that you could, upon considering it, split it more naturally.
Now, much of the efficiency concern here goes down to three things:
A single table can be scanned through more quickly than joins across several.
Large rows have a memory and disk-space cost in themselves.
If a single table represents something that is really a 1-to-many, then it requires more work on insert, delete or update.
Point 2 only really comes in, should there be a lot of cases where you need one particular subset of the data, and another batch where you need another subset, and maybe just a few where you need them all. If you're using most of the columns in most places, then it doesn't gain you anything. In that case, splitting tables is bad.
Point 1 and 3 argue for and against joining into one big table, respectively.
Before any of that though, let's get back to "almost all". If there are several rows with a batch of null fields, why? Often answering that "why?" reveals that really there's a natural split there, that should be broken off into another table as part of normal normalisation*. Repetition of fields, is an even greater suggestion that this is the case.
Do this first.
To denormalise - whether by splitting what is naturally one table, or joining what is naturally several - is a very particular type of optimisation - it makes some things more efficient at the cost of making other things less efficient, and it introduces possibilities of bugs that don't exist otherwise. I would never say you should never denormalise - I do it myself - but you need to be able to say "I am denormalising table X & Y in this manner, because it will help case C which happens enough and I can live with the extra cost to case D". Then you need to check it actually did help case C significantly and case D insignificantly, along with looking for hidden costs.
One of the reasons for normalising in the first place is it gives good average performance over a wide range of cases. It's the balance you want most of the time. Denormalising from the get-go rather than with a normalised database as a starting point is almost always premature.
*Fun trivia fact: The name "normalization" was in part a take on Richard Nixon's "Vietnamisation" policy meaning there was a running joke in some quarters of adding "-isation" onto just about anything. Were it not for the Whitehouse's reaction to the Tet Offensive, we could be using the gernund "normalising," or something completely different instead.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

How to decide between row based and column based table structures?

I've some data set, which has hundreds of parameters (with more coming in)
If I dump them in one table, it'll probably end up having hundreds of columns (and I am not even sure how many, at this point)
I could do row based, with a bunch of meta tables, but somehow row based structure feels unintuitive
One more way would be to keep column based, but have multiple tables (split the tables logically) which seems like a good solution.
Is there any other way to do it? If yes, could you point me to some tutorial? (I am using mysql)
EDIT:
based on the answers, I should clarify one thing - updates and deletes are going to be much lesser, than inserts and selects. as it is, selects are going to be the bulk of the operations, so selects have to be fast.
I ran across several designs where a #4 was possible:
Split your columns into searchable and auxiliary
Define a table with only searchable columns, and an extra BLOB column
Put everything in one table: searchable columns go as-is, auxiliary go as a BLOB
We used this approach with BLOBs of XML data or even binary data, representing the entire serialized object. The downside is that your auxiliary columns remain non-searchable for all practical purposes. The upside is that you can add new auxiliary columns at will without changing the schema. You can also make schema changes to make previously auxiliary columns searchable with a schema change and a very simple program.
It all depends on the kind of data you need to store.
If it's not "relational" at all - for instance, a collection of web pages, documents, etc - it's usually not a good fit for a relational database.
If it's relational, but highly variable in schema - e.g. a product catalogue - you have a number of options:
single table with every possible column (your option 1)
"common" table with the attributes that each type shares, and joined tables for attributes for subtypes
table per subtype
If the data is highly variable and you don't want to make schema changes to accommodate the variations, you can use "entity-attribute-value" or EAV - though this has some significant drawbacks in the context of relational database. I think this is what you have in mind with option 2.
If the data is indeed relational, and there is at least the core of a stable model in the data, you could of course use traditional database design techniques to come up with a schema. That seems to correspond with option 3.
Does every item in the data set have all those properties? If yes, then one big table might well be fine (although scary-looking).
On the other hand, perhaps you can group the properties. The idea being that if an item has one of the properties in the group, then it has all the properties in that group. If you can create such groupings, then these could be separate tables.
So should they be separate? Yes, unless you can prove that the cost of performing joins is unacceptable. Perform all SELECTs via stored procedures and you can denormalise later on without much trouble.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.