I am designing a database and have run into something that I do not like. I will have up to 1000 different scenarios, and in each scenario I will need to store the state of each one of 64 different toggles. I've come up with the following tables:
Scenario (Scenario_ID, Scenario_Name)
Toggles (Scenario_ID, Toggle_1, Toggle_2, ..., Toggle_64)
This leaves me with a 65-column Toggles table. There has to be a better way than having 64 columns of "on" or "off", but I am at a loss as to what it might be. I don't want to store the toggle state in a CSV in one column because it will be constantly changing, and needs to be parsed quite often. It would be much easier for me to update the table by simply updating Toggle_14 to "off" rather than parsing a CSV, changing it, and reloading it.
What is a better database design?
Use a Many-to-Many relation:
Table Scenario (Id, Name)
Table ScenarioToggles (ScenarioId, TogglesId, ToggleState)
Table Toggles (Id, ToggleName)
The field, ToggleState in the ScenarioToggles table will hold the On/Off value.
It also yields some flexibility if you ever need more than 64 toggles.
You can design Toogles Table it like this:
Scenario_ID
Toogle_nr
Toogle_state
and get 64 entries for each scenario.
However, if toogles count is not going to change often and in every scenario you'll need all the toogles, your solution seems to be the best and the simplest.
It's a hack (sort of, ish, kinda), but depending on your lookup requirements, you could just store a single 64 character string, which you could access in your programming language of choice as an array.
However, the many, many, downsides of doing this more than likely outweigh the benefits. (Then again, storing 64 bits of flippable state information in a database is a pretty odd requirement in the first place.)
Since your question is tagged "mysql" I'd recommend using the SET type. Updating records to toggle values can be a little tricky though, as MySQL doesn't natively offer functions to that purpose (search for REMOVE_FROM_SET at MySQL's bug tracker for a workaround.) On the other hand, it's very compact and it won't require you to join 64 tables or use aggregate functions for every SELECT query.
Some people are just trying to be way too clever. Use one table, with scenario Id, name, and 64 toggle columns. This is well normalised, fast, space efficient, and extensible.
Related
Edit 1:
Because few good ppl have pointed out that my question isnt very clear, I thought I will rewrite it and make it more clear now.
So basically, I am making an app, which allows users to create his own form with his own set of input fields, with data like name, type etc. After creating his form and he publishes the form, whenever there is an entry in the form, the data gets saved into the db ofcourse. Because the form itself is dynamic. I need a way to save this data.
My first choice was JSONizing it and saving. But because I cannot do any SQL queries on them, if I save in JSON format, i am eliminating this option.
Then the simple method is storing in a table like (id, rowid, columnname, value) and i keep the rowid same for all row data. But in this way, if a form contains 30 fields, after 100 entries my db would have 3000 rows. so in the long run, it would go huge and I think queries will get slow when there are millions of rows in the table.
Then I got this idea of a table like (id, rowid, column1, column2...column100). And i will save all the inputs in the form into single row. In this way it does add only 1 row per submit and its easier to query too. I will store the actual column names and map them to the right column(number) from there. This is my idea. column100 because 100 is the maximum inputs the user can add in his form.
So my question is, whether my idea is good, or should I stick to the classic table.
If I've understood your question, you need to have to design a database structure to store data whose schema you don't know in advance.
This is hard - there's no "efficient" solution in relational databases that I'm aware of.
Option 1 would be to look at a non-relational (NoSQL) solution instead.
I won't elaborate the benefits and drawbacks, as they are highly dependent on which NoSQL option you choose.
It's worth noting that many relational engines (including MySQL) allow you to store and query structured data formats like JSON. I've not used this feature in MySQL myself, but similar functionality in SQL Server performs very well.
Within relational databases, the common solution is an "Entity/Attribute/Value" (EAV)schema. This is sorta like your option 2.
EAV designs can theoretically store an unlimited number of columns, and an unlimited number of rows - but common queries quickly become impossible. In your sample data, finding all records where the name begins with a K and the power is at least 22 turns into a very complex SQL query. It also means the application needs to enforce rules of uniqueness, mandatory/optional data attributes, and data transformation from one format to another.
From a performance point of view, this doesn't really scale to complex queries. This is because every clause in your "where" needs a self join, and indexes won't have a big impact on searches for non-text strings (searching for numerical "greater than 20" is not the same as searching for a text "greater than 20".).
Option 3 is, indeed, to make the schema logic fit into a limited number of columns (your option 1).
It means you have a limitation on the number of columns, and you still have to manage mandatory/optional, uniqueness etc. in the application. However, querying the data should be easier - finding accounts where the name starts with K and the power is at least 22 is a fairly straightforward exercise.
You do have a lot of unused columns, but that doesn't really impact performance much - disk space is so cheap that all the wasted space is probably less space than you carry around in your smart phone.
If I understand your requirement, what I will do with your requirement is to create a many to many relationship something like this:
(tbl1) form:
- id
- field1
- field2
(tbl2) user_added_fields:
- id
- field_name
(tbl3) form_table_user_added_fields:
- form_id (fk)
- user_added_fields_id (fk)
This may not likely to solve your own requirements, but I hope this will give you a hint. Happy coding! :)
I collect a list of Facebook friends from my users including First, Last, Gender and DOB. I am then attempting to compare that database of names (stored as a table in MySQL) to another database comprised of similar information.
What would be the best way to conceptually link these results, with the second database being the much larger set of records (>500k rows)?
Here was what I was proposing:
Iterate through Facebook names
Search Last + DOB - if they match, assume a "confident" match
Search Last + First - if they match, assume a "probable" match
Search Last + Lichtenstein(First) above a certain level, assume a "possible" match
Are there distributed computing concepts that I am missing that may make this faster than a sequential mySQL approach? What other pitfalls may spring up, noting that it is much more important to not have a false-positive rather than miss a record?
Yes, your idea seems like a better algorithm.
Assuming performance is your concern, you can use caching to store the values that are just being searched. You can also start indexing the results in a NoSQL database such that the results will be very faster, so that you will have better read performance. If you have to use MySQL, read about polyglot persistence.
Assuming simplicity is your concern, you can still use indexing in a NoSQL database, so over the time you don't have to do myriad of joins will spoil the experience of the user and the developer.
There could be much more concerns, but it all depends on where would you like to use it, to use in a website, or to such data analytic purpose.
If you want to operate on the entire set of data (as opposed to some interactive thing), this data set size might be small enough to simply slurp into memory and go from there. Use a List to hang on to the data then create a Map> that for each unique last name points (via integer index) to all the places in the list where it exists. You'll also set yourself up to be able to perform more complex matching logic without getting caught up trying to coerce SQL into doing it. Especially since you are spanning two different physical databases...
I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.
I'm planning to develop a PHP Web App, it will mainly be used by registered users(sessions)
While thinking about the DB design, I was contemplating that in order to give the best user experience possible there would be lots of options for the user to activate, deactivate, specify, etc.
For example:
- Options for each layout elements, dialog boxes, dashboard, grid, etc.
- color, size, stay visible, invisible, don't ask again, show everytime, advanced mode, simple mode, etc.
This would get like 100s of fields ranging from simple Yes/No or 1 to N values..., for each user.
So, is it having a field for each of these options the way to go?
or how do those CRMs or CMS or other Web Apps do it to store lots of 1-2 char long values?
Do they group them on Text fields separated by a special char and then "explode" them as an array for runtime usage?
thank you
How about something like this:
CREATE TABLE settings (
user_id INT,
setting_name VARCHAR(255),
setting_value CHAR(2)
)
That way, to store a configuration setting for a user, you can do:
INSERT INTO settings (user_id, setting_name, setting_value),
VALUES (1, "timezone", "+8")
And when you need to query a setting for a particular user, you can do:
SELECT setting_value FROM settings
WHERE user_id = 1 AND setting_name = "timezone"
I would absolutely be inclined to have individual fields for each option. My rule of thumb is that each column holds exactly one piece of data whenever possible. No more, no less. As was mentioned earlier, the ease of maintenance and the ability to add / drop options down the road far outweighs the pain in the arse of setting it up. I would, however, put some thought into how you create the table(s). The idea mentioned earlier was to have a Settings table with 100 columns ( one for each option ) and one row for each user. That would work, to be sure. If it were me I would be inclined to break it down a bit further. You start with a basic User table, of course. That would hold the basics of username, password, userid etc. That way you can use the numeric userid as the key index for your Settings table(s). But after that I would try to break down the settings into smaller tables based on logical usage. For example, if you have 100 options, and 19 of those pertain to how a user views / is viewed / behaves in one specific part of the site, say something like a forum, then break those out into a separate table ie ForumSettings. Maybe there are 12 more that pertain to email preferences, but would not be used in other areas of the site / app. Now you have an EmailSettings table. Doing this would not only reduce the number of columns in your generic Settings table, but it would also make writing queries for specific tasks or areas of the app much easier, speed up the performance a tick, and make maintenance moving forward far less painful. Some may disagree as from a strictly data modeling perspective I'm pretty sure that the one Settings table would be indicated. But from a real world perspective, I have never gone wrong using logical chunks such as this.
From a pure data-model perspective, that would be the clearest design (though awful wide). Some might try to bitmask them into a single field for assumed space reasons, but the logic to encode/decode makes that not worthwhile, in my opinion. Also you lose the ability to index on them.
Another option (I just saw posted) is to hold a separate table with an FK back to the user table. But then you have to iterate over the results to get the value you want to check for.
I've taken over development on a project that has a user table with over 30 columns. And the bad thing is that changes and additions to the columns keep happening.
This isn't right.
Should I push to have the extra fields moved into a second table as values and create a third table that stores those column names?
user
id
email
user_field
id
name
user_value
id
user_field_id
user_id
value
Do not go the key / value route. SQL isn't designed to handle it and it'll make getting actual data out of your database an exercise in self torture. (Examples: Indexes don't work well. Joins are lots of fun when you have to join just to get the data you're joining on. It goes on.)
As long as the data is normalized to a decent level you don't have too many columns.
EDIT: To be clear, there are some problems that can only be solved with the key / value route. "Too many columns" isn't one of them.
It's hard to say how many is too many. It's really very subjective. I think the question you should be asking is not, "Are there too many columns?", but, rather, "Do these columns belong here?" What I mean by that is if there are columns in your User table that aren't necessarily properties of the user, then they may not belong. For example, if you've got a bunch of columns that sum up the user's address, then maybe you pull those out into an Address table with an FK into User.
I would avoid using key/value tables if possible. It may seem like an easy way to make things extensible, but it's really just a pain in the long run. If you find that your schema is changing very consistently you may want to consider putting some kind of change control in place to vet changes to only those that are necessary, or move to another technology that better supports schema-less storage like NoSQL with MongoDB or CouchDB.
This is often known as EAV, and whether this is right for your database depends on a lot of factors:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
http://karwin.blogspot.com/2009/05/eav-fail.html
http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back
Too many columns is not really one of them.
Changes and additions to a table are not a bad thing if it means they accurately reflect changes in your business requirements.
If the changes and additons are continual then perhaps you need to sit down and do a better job of defining the requirements. Now I can't say if 30 columns is toomany becasue it depends on how wide they are and whether thay are something that shouldbe moved to a related table. For instnce if you have fields like phone1, phone2, phone 3, youo have a mess that needs to be split out into a related table for user_phone. Or if all your columns are wide (and your overall table width is wider than the pages the databases stores data in) and some are not that frequently needed for your queries, they might be better in a related table that has a one-to-one relationship. I would probably not do this unless you have an actual performance problem though.
However, of all the possible choices, the EAV model you described is the worst one both from a maintainabilty and performance viewpoint. It is very hard to write decent queries against this model.
This really depends on what you're trying to do.