MySQL broad schema vs multiple tables

MySQL broad schema vs multiple tables - mysql

I'm currently modeling the schema for a project which will be using MySQL 5.5. I'm having issues, however with a current feature that's being requested. This feature would require there to be very specific sum of data pertaining to a certain "specialty". There are 47 specialties, and this will rarely change(possibly never). The problem is each of these 47 specialties would require its own table with a different schema.
This would mean my "programs" table will contain 47 different foreign keys, only one of which would bare an actual relation. This seems inefficient to me. However, adding possibly a thousand values to a single table doesn't seem much better, as most of those values won't be provided in any given scenario.
I feel like I'm choosing the better of 2 evils. Though the thought crossed my mind to embed these fields in a TEXT column without a schema, possibly as JSON. These fields don't need to be indexed, but this approach feels dirty also.
I've never dealt with an issue such as this. Is there any alternative to the approaches I mentioned? If not, what approach is best and why?

Related

mysql table with 40+ columns

I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)

As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.

I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.

At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.

I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.

SQL one-to-one relationships vs flattening

I'm using a standard SQL database and I'm trying to figure out whether or not to flatten a table or make it more "object-oriented". To me, smaller tables are easier to read but it would require joining tables and having one-to-one relationships. Is this generally a good way of doing things or is it frowned on in the SQL world?
I have a table which has the following attributes:
MYTABLE
- ID
- NAME
- LABEL
- CREATED_TS
- MODIFIED_TS
- CREATED_USER
- MODIFIED_USER
To me, the created/modified fields would be their own object. There are actually a few more fields as well so it's not really just this small. I would think that creating another table called "MYTABLE_MODINFO" or something like that which would have the CREATED and MODIFIED fields and they would be joined when data from them was needed. These tables aren't high access tables, they wouldn't have tons of queries per minute or even hundreds of rows in them, so I don't think efficiency would be much of an issue.
So mainly what I'm wondering is would this be a generally accepted design or should you generally keep your table structures flat?

You should create audit information in the same table. The reason is that this data is part of the row and is a one to one relationship, so there is no point in branching it apart.
If you want to store the audit info (audit tracking/history), then you can create another table, however in most cases I have seen this built by "duplicating" data and creating a surrogate key and mappings back to the original row. The reason I list duplicating in quotes is because auditing inherently requires duplication of the old data...if it is linked and changeable after being written, then it is not really an audit.
Just my two cents. If it does not make sense, then I can provide some examples. But, the gist is that each row will only ever have one current piece of modification information, so why break it out if it will never have more than one?

avoid a database 'one to one', you'll lose performance, scalability, independence. can you imagine what happen if you want to store 2 pictures per ID? will you create another field or will you repeat the row??... it's easier to create relationship to have more freedom when you want to upgrade, please review this tutorials.
http://www.youtube.com/watch?v=Onzm-PxSjtE
http://folkworm.ceri.memphis.edu/ew/SCHEMA_DOC/comparison/erd.htm
http://www.visual-paradigm.com/product/vpuml/provides/dbmodeling.jsp
Beside that you should normalize the DB to be sure that everything is in the best shape possible. Remember that the most important is to take what you need and adapt it.
http://databases.about.com/od/specificproducts/a/normalization.htm
http://www.youtube.com/watch?v=xzeuBwHkKxw

RDBMS design aren't the same with object-oriented approach in my view. the example you mentioned aren't different objects domain but data inheritance of your record. Since there would not be any overhead of tons of queries/execution of the table so you should keep them in the same table for auditing purpose and also easier to work with at normalize data.

Database Design For Tournament Management Software

I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!

Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.

First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.

Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.

The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?

How many database table columns are too many?

I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value

Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.

It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.

This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.

Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.

If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.

This really depends on what you're trying to do.

MySQL / Rails Performance: One table, many rows vs. many tables, less rows?

In my Rails App I've several models dealing with assets (attachments, pictures, logos etc.). I'm using attachment_fu and so far I have 3 different tables for storing the information in my MySQL DB.
I'm wondering if it makes a difference in the performance if I used STI and put all the information in just 1 table, using a type column and having different, inherited classes. It would be more DRY and easier to maintain, because all share many attributes and characteristics.
But what's faster? Many tables and less rows per table or just one table with many rows? Or is there no difference at all? I'll have to deal with a lot of information and many queries per second.
Thanks for your opinion!

Many tables and fewer rows is probably faster.
That's not why you should do it, though: your database ought to model your Problem Domain. One table is a poor model of many entity types. So you'll end up writing lots and lots of code to find the subset of that table that represents the entity type you're currently concerned with.
Regular, accepted, clean database and front-end client code won't work, because of your one-table-that-is-all-things-and-no-thing-at-all.
It's slower, more fragile, will multiply your code all over you app, and makes a poor model.
Do this only if all the things have exactly the same attributes and the same (or possibly Liskov substitutable) semantic meaning in your problem domain.
Otherwise, just don't even try to do this.
Or if you do, ask why this is any better than having one big Map/hash table/associative array to hold all entities in your app (and lots of functions, most of them duplicared, cut and paste, and out of date, doing switch cases or RTTI to figure out the real type of each entity).

The only way to know for sure is to try both approaches and measure the performance.
In general terms, it depends if you're doing joins across those tables and if you are, how the tables are indexed. Generally speaking, database joins are expensive which is why database schemas are sometimes denormalized to improve performance. This doesn't usually happen until you're dealing with a serious amount of data though i.e. millions of records. You probably don't have that problem yet and maybe never will.

If rows have same attributes then, yes, one table is very better, and just one row to specify type of data, otherwise, use differents tables to deal with, that better in performance, code amount and even in the lisibility of code aswell.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008