Considerable slowdown due to 40 boolean columns? - mysql

I'm trying to add musical style columns to an event table and from what I've gathered this can be either done by adding a column for each musical style or by doing a many-to-many table relation which I don't want because I want each event returned only once in a table. Do you think that having that many boolean columns in a row, that I would face considerable database slowdown? (Data will only be read by the users).
Thank you :)

The columns will not slow down the database per se, but keep in mind that adding a boolean column for every music style is very poor design. Over time, the possible musical styles in your application are likely to change: maybe new ones must be added, redundant or useless ones must be removed, whatever. With your proposed design, you'd have to modify your database structure to add the new columns to the table. This is usually painful, error-prone, and you'll also have to go over all your queries to make sure they don't break because of the new structure.
You should design your database schema so that it's flexible enough to allow for variance over time in the contents of your application. For example, you could have a master table with one row for every musical style, defining its ID and its name, description etc. Then, a relationships table that contains the relationship between an entity (event, if I understood your question correctly) and a music style from the master table. You enforce consistency by putting foreign keys in place, to ensure data is always clean (eg. you cannot reference a music style that is not in the master table). This way ,you can modify the music styles without touching anything in the database structure.
Reading a bit on database normalization will help you a lot; you don't have to go all the way to have a fully normalized database, but understanding the principles behind will allow you to design efficient and clean database structures.

The likely answer is no
Having multiple boolean columns in a row should not significantly slow down DB performance; assuming your indices are set up appropriately.
EDIT: That being said it may be optimal to assign a details table and JOIN to it to get this data... but you said you didn't want to do that.
I assume you want to do something like have an event row with a bunch of columns like "isCountry", "isMetal", "isPunk" and you'll query all events marked as
isPunk = 1 OR isMetal = 1
or something like that.
The weakness of this design is that to add/remove musical styles you need to change your DB schema.
An Alternative is a TBLMusicalStyles with an ID and a Name, then a TBLEventStyles which will just contains EventID and StyleID
Then you could join them and just search on the styles table... and adding and removing styles would be relatively simple.

The performance of requests that do not involve musical styles will not be affected.
If your columns are properly indexed, then requests that involve finding lines that match client-provided musical styles should actually be faster.
However, all other requests that involve musical styles will be significantly slower, and also harder to write. For instance, "get all lines that share at least one style with the current line" would be a much harder request to write and execute.

Related

SQL one-to-one relationships vs flattening

I'm using a standard SQL database and I'm trying to figure out whether or not to flatten a table or make it more "object-oriented". To me, smaller tables are easier to read but it would require joining tables and having one-to-one relationships. Is this generally a good way of doing things or is it frowned on in the SQL world?
I have a table which has the following attributes:
MYTABLE
- ID
- NAME
- LABEL
- CREATED_TS
- MODIFIED_TS
- CREATED_USER
- MODIFIED_USER
To me, the created/modified fields would be their own object. There are actually a few more fields as well so it's not really just this small. I would think that creating another table called "MYTABLE_MODINFO" or something like that which would have the CREATED and MODIFIED fields and they would be joined when data from them was needed. These tables aren't high access tables, they wouldn't have tons of queries per minute or even hundreds of rows in them, so I don't think efficiency would be much of an issue.
So mainly what I'm wondering is would this be a generally accepted design or should you generally keep your table structures flat?
You should create audit information in the same table. The reason is that this data is part of the row and is a one to one relationship, so there is no point in branching it apart.
If you want to store the audit info (audit tracking/history), then you can create another table, however in most cases I have seen this built by "duplicating" data and creating a surrogate key and mappings back to the original row. The reason I list duplicating in quotes is because auditing inherently requires duplication of the old data...if it is linked and changeable after being written, then it is not really an audit.
Just my two cents. If it does not make sense, then I can provide some examples. But, the gist is that each row will only ever have one current piece of modification information, so why break it out if it will never have more than one?
avoid a database 'one to one', you'll lose performance, scalability, independence. can you imagine what happen if you want to store 2 pictures per ID? will you create another field or will you repeat the row??... it's easier to create relationship to have more freedom when you want to upgrade, please review this tutorials.
http://www.youtube.com/watch?v=Onzm-PxSjtE
http://folkworm.ceri.memphis.edu/ew/SCHEMA_DOC/comparison/erd.htm
http://www.visual-paradigm.com/product/vpuml/provides/dbmodeling.jsp
Beside that you should normalize the DB to be sure that everything is in the best shape possible. Remember that the most important is to take what you need and adapt it.
http://databases.about.com/od/specificproducts/a/normalization.htm
http://www.youtube.com/watch?v=xzeuBwHkKxw
RDBMS design aren't the same with object-oriented approach in my view. the example you mentioned aren't different objects domain but data inheritance of your record. Since there would not be any overhead of tons of queries/execution of the table so you should keep them in the same table for auditing purpose and also easier to work with at normalize data.

Database Design For Tournament Management Software

I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?

Is this a good case for decomposing a table and having a 1-1 table relationship?

I am storing information about websites in a table. One set of information is the whois data about a websites domain name. This set of data contains about 40 fields and each record relates to a single website. I have no requirement to track updates. I could put all the whois data in the websites table, but it seems 'cleaner' and more intuitive to have the domain whois information in a new table with a 1-1 mapping.
What is the best solution in this case? Is a table with many fields always preferable over two smaller tables with an unnecessary join?
It would probably be easier to leave this as one table and use a view to "simplify" the data for the consumers.
One thing to consider is that your needs may change over time and you'll find you'll need to change how you split the table. If you just use a view, it's very simple to alter a view without having to figure out how to move the data from one table to the other.
It depends on your application. What does your app do with website data? what does it do with the related whois data?
If you often access the website data, and seldom access the whois data (or the other way around) it would make sense to separate them. This is not so much a relational or logical way or reasoning, more a practical, performance-related reason. From a purely relational point of view, it would have to go in the same table.
If i think about it, I am having trouble coming up with a real world genuine 1:1 example that would make sense in a purely relational model. This is not the case for a 1:0 example: subtypes are naturally modeled as a parent table having one or more optional related rows in child tables in a 1:0 fashion.
A join is always costly. The only reason I would really consider splitting the two is if you will often query one set of columns, and very rarely the other.
If the performance hit of the join doesn't bother you, splitting up the data into two tables might make sense (no need to avoid duplicate column names, etc).
If the two sets of data have very different update/read frequencies, splitting can improve cache hit ratio by removing the seldom-used fields into a separate table. But, as all performance things, this is very dependent on your work load, might change on a moment's notice, is not aligned with your relation model and should be throughly benchmarked.
A join doesn't necessarily cost anything. Depending on how the tables are stored the join could be a no-op. Note that such tables are not usually true 1-1 because a foreign key is always optional on one side of the constraint. So if the whois data does not apply to every row then that's a good reason to have two tables.

MySQL DB design help

Pardon the elementary question but my newness to the realm of database design leaves me in a bind quite often.
I have a site that keeps growing with regard to families of information. In the beginning I had one sort of item I was describing and all was well. That item occupied one record and had 34 columns (a lot now that I look back) attributed to it of descriptive data. As I get more and more into this stuff, I see that many developers break out data (when practical) into distinct tables.
I've now got additional tables that relate to the original item but are not always needed when describing the original item so I broke them out so they're not queried unnecessarily.
Anyhow, I have a new item I've been trying to organize which is a USER. The user table has typical columns like username, email, last_login, path to associated image, etc. These users have been making comments, which I keep in yet another table that contains columns with IDs that relate to the user and the item on which they are commenting.
Now... I am in the process of adding the obligatory user profile page to the site. Should I create yet another table containing only essential profile data or append the existing user record with profile data in the original user table? I am thinking housekeeping might be a pain if I am to add a "Remove me from site" function as I would have to run something that kills the user record, the user profile record, and any other data associated with that user ID in other tables.
Basically what I am asking is should I keep going with this "granular" design method - breaking everything out into essential parts or does it ever serve me to consolidate into larger tables? I see a few instances where if a user deletes their account, I'll be left with a bunch of non-relevant data. For instance, the original item are restaurants... if I make a table to record "Visits" to restaurants, containing the Restaurant ID and the User ID, if the user or restaurant get removed from the site, this "Visits" table will have a bunch of useless records saying either "non existent restaurant was visited by user 45" or "Restaurant 21 was visited by non-existent user"
I hope I make sense here... I'm just wondering if it's normal to end up with this "junk" data over time.
Thanks much,
Rob
Deleting that "on-relevant" data is a normal, healthy part of an application's life. It's just what happens. You just have to do it, like you brush your teeth or make your bed. Don't let two or three DELETE queries influence how your tables get structured. They're not that expensive, and honestly, if you think that's too much of a pain, you're in the wrong business :)
If you're using InnoDB tables, you can look into foreign key constraints that will take care of some of the cleanup for you.
You'll be able to make these decisions much more easily if you learn about normalization.
In general, if data all relates to the same logical entity -- the same "thing" -- then it should go in the same table. Breaking one table into two just to keep the tables smaller is generally not a good idea. Depending on what you are doing, it may or may not make queries faster, and it introduces unnecessary complexity. Let me explain.
Whether it makes queries faster depends on the nature of the data and how you use it. If you have some very large field, like "rambling_comments varchar(5000)" or some such, and it is rarely used, then breaking it into a separate table so that what's left in the "main" table is relatively small could indeed make your queries faster, for the fairly obvious reason that there is now less data to read. But if the size of the fields you are thinking of breaking out are modest, and you often need data from both tables, then queries that only use one table don't gain that much, and queries that use both now need to do a join, which is usually more expensive than reading a somewhat bigger record.
But breaking up your tables will certainly make your programs more complex. Now you have to keep track of which data is in which table. You'll constantly be checking if that field is in the Item_Descriptive_Data table or the Item_Stock_Data table or whatever. You're liable to lose track at some point and accidentally put the same field into two tables. (Or worse, you'll decide this is a good idea and do it deliberately.) Then you have redundant and potentially contradictory data.
You have to do joins every time you need data that crosses tables. You create the possibility that records in one or more of the tables may not exist. Like, if you break your User table into User_Main and User_Profile, and you need data from both tables so you do a join, what happens if there is a record in User_Profile with no corresponding record in User_Main? You're going to have to add code to check for the possibility and deal with it. Oh, and blithely saying "That can never happen, no need to worry about it" is a very dangerous attitude: No matter that it's not SUPPOSED to happen, sooner or later it will, and if you don't handle the error gracefully, you could have a real mess.
In short, breaking up tables for performance reasons is usually a premature optimization. If you find that you have some real performance problem, THEN look at the tables and see if you should denormalize for efficiency. But don't start out trashing your database just to avoid a problem that might possibly happen someday.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.