Efficient MySQL Database Structure for Dynamic Form Creation - mysql

I'm creating an application in Codeigniter which will allow anyone, without signing in, to create a form to be filled out using different input types (using text boxes, dropdowns, checkboxes, etc.). This form could be 1-100 questions and when completed it will be emailed to someone else who will then fill it out on the site.
I first set up my MySQL database similar to this post, with quite a few different tables all with only a few columns. I then indexed and used foreign keys to link the information.
Since then, I have changed and set up my database like this so I'm making fewer queries:
Document
id, name, email, recipientname, recipientemail, document name
Document Questions
document_id, question_id, question, type, comments
Is having more tables with fewer columns but more queries more efficient than how I'm doing it now? I understand that normalization plays a role, but to what extent are you hindering performance by making your tables so specifically small?

From a normalization point of view there are things you could do to further normalize your data (recipients could have their own entity and types could also), but it's not always the most optimal way of accessing your data.
For example, if you split your problem into 4 different entities (Types could just as easily be an ENUM):
Documents
Document Questions
Recipients
Types
Then to fetch a single form for your application you would be executing a query with multiple joins. If you're using MyISAM then all four of your tables become locked until the query finishes. Queries with bad joins and bad indexes can become very slow.
A better alternative would be to execute four separate queries on the database (add indexes relative to the most common queries you're running) to retrieve your data, this way tables will stay locked for a shorter period of time.
I know this is an extreme example, but I would concentrate more on your index optimization and strike a good balance between normalization and performance.
To sum up, sometimes fully normalized data means lower performance.

Related

What is more efficient, a table with 100 columns and less rows or 5 columns and 30 times more rows

Edit 1:
Because few good ppl have pointed out that my question isnt very clear, I thought I will rewrite it and make it more clear now.
So basically, I am making an app, which allows users to create his own form with his own set of input fields, with data like name, type etc. After creating his form and he publishes the form, whenever there is an entry in the form, the data gets saved into the db ofcourse. Because the form itself is dynamic. I need a way to save this data.
My first choice was JSONizing it and saving. But because I cannot do any SQL queries on them, if I save in JSON format, i am eliminating this option.
Then the simple method is storing in a table like (id, rowid, columnname, value) and i keep the rowid same for all row data. But in this way, if a form contains 30 fields, after 100 entries my db would have 3000 rows. so in the long run, it would go huge and I think queries will get slow when there are millions of rows in the table.
Then I got this idea of a table like (id, rowid, column1, column2...column100). And i will save all the inputs in the form into single row. In this way it does add only 1 row per submit and its easier to query too. I will store the actual column names and map them to the right column(number) from there. This is my idea. column100 because 100 is the maximum inputs the user can add in his form.
So my question is, whether my idea is good, or should I stick to the classic table.
If I've understood your question, you need to have to design a database structure to store data whose schema you don't know in advance.
This is hard - there's no "efficient" solution in relational databases that I'm aware of.
Option 1 would be to look at a non-relational (NoSQL) solution instead.
I won't elaborate the benefits and drawbacks, as they are highly dependent on which NoSQL option you choose.
It's worth noting that many relational engines (including MySQL) allow you to store and query structured data formats like JSON. I've not used this feature in MySQL myself, but similar functionality in SQL Server performs very well.
Within relational databases, the common solution is an "Entity/Attribute/Value" (EAV)schema. This is sorta like your option 2.
EAV designs can theoretically store an unlimited number of columns, and an unlimited number of rows - but common queries quickly become impossible. In your sample data, finding all records where the name begins with a K and the power is at least 22 turns into a very complex SQL query. It also means the application needs to enforce rules of uniqueness, mandatory/optional data attributes, and data transformation from one format to another.
From a performance point of view, this doesn't really scale to complex queries. This is because every clause in your "where" needs a self join, and indexes won't have a big impact on searches for non-text strings (searching for numerical "greater than 20" is not the same as searching for a text "greater than 20".).
Option 3 is, indeed, to make the schema logic fit into a limited number of columns (your option 1).
It means you have a limitation on the number of columns, and you still have to manage mandatory/optional, uniqueness etc. in the application. However, querying the data should be easier - finding accounts where the name starts with K and the power is at least 22 is a fairly straightforward exercise.
You do have a lot of unused columns, but that doesn't really impact performance much - disk space is so cheap that all the wasted space is probably less space than you carry around in your smart phone.
If I understand your requirement, what I will do with your requirement is to create a many to many relationship something like this:
(tbl1) form:
- id
- field1
- field2
(tbl2) user_added_fields:
- id
- field_name
(tbl3) form_table_user_added_fields:
- form_id (fk)
- user_added_fields_id (fk)
This may not likely to solve your own requirements, but I hope this will give you a hint. Happy coding! :)

Having data stored across tables representing individual data types - Why is it wrong?

Say I have lots of time to waste and decide to make a database where information is not stored as entities but in separate inter-related tables representing INT,VARCHAR,DATE,TEXT, etc types.
It would be such a revolution to never have to design a database structure ever again except that the fact no-one else has done it probably indicates it's not a good idea :p
So why is this a bad design ? What principles is this going against ? What issues could it cause from a practical point of view with a relational database ?
P.S: This is for the learning exercise.
Why shouldn't you separate out the fields from your tables based on their data types? Well, there are two reasons, one philosophical, and one practical.
Philosophically, you're breaking normalization
A properly normalized database will have different tables for different THINGS, with each table having all fields necessary and unique for that specific "thing." If the only way to find the make, model, color, mileage, manufacture date, and purchase date of a given car in my CarCollectionDatabase is to join meaningless keys on three tables demarked by data-type, then my database has almost zero discoverablity and no real cohesion.
If you designed a database like that, you'd find writing queries and debugging statements would be obnoxiously tiresome. Which is kind of the reason you'd use a relational database in the first place.
(And, really, that will make writing queries WAY harder.)
Practically, databases don't work that way.
Every database engine or data-storage mechanism i've ever seen is simply not meant to be used with that level of abstraction. Whatever engine you had, I don't know how you'd get around essentially doubling your data design with fields. And with a five-fold increase in row count, you'd have a massive increase in index size, to the point that once you get a few million rows your indexes wouldn't actually help.
If you tried to design a database like that, you'd find that even if you didn't mind the headache, you'd wind up with slower performance. Instead of 1,000,000 rows with 20 fields, you'd have that one table with just as many fields, and some 5-6 extra tables with 1,000,000+ entries each. And even if you optimized that away, your indexes would be larger, and larger indexes run slower.
Of course, those two ONLY apply if you're actually talking about databases. There's no reason, for example, that an application can't serialize to a text file of some sort (JSON, XML, etc.) and never write to a database.
And just because your application needs to store SQL data doesn't mean that you need to store everything, or can't use homogenous and generic tables. An Access-like application that lets user define their own "tables" might very well keep each field on a distinct row... although in that case your database's THINGS would be those tables and their fields. (And it wouldn't run as fast as a natively written database.)

MySQL Database Structure

I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table
Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.
Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark

mysql table with 40+ columns

I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.