Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm doing something different but this is an easier to understand example. Think of the votes here. I add these votes to a separate table and log information about them like by who, when and so on. Would you also add a field to the main table that simply counts the number of votes or is this bad practice.
This is called "denormalization" and is considered bad practice unless you get a significant performance boost when you denormalize.
The biggest issue with this, however, is concurrency. What happens if two people vote on the poll and they both try to increment the VoteCount column?
Search denormalization on here and in Google, there has been plenty of discussion on this topic. Find what fits your exact situation best, although, from the looks of it, denormalization would be premature optimization in your situation.
Bad.
Incorrect.
Guaranteed problems and data inconsistencies. The vote count is "derived data" and should not be stored (a duplicate). For stable data (that which does not change), summaries are fair enough.
Now if the data (no of votes) is large, and you need to count them often (in queries), then enhance that alone, the speed of the vote table from the main table, eg ensure there is an index on column being looked up for the count.
If the data is massive. Eg. a bank with millions of transactions per month, and you do not want to count them in order to produce the account balance on every query, enhance that alone. Eg. I calculate a month to date figure every night and store it at the account level; the days figure, needs to be counted, and added to the MTD figure, in order to produce the true up-to-the-minute figure. At the end of month, that month, when all the auditing processes are changing various rows across the month, the MTD figure (to yesterday) can be executed on demand.
The short answer is YES. But you should keep in mind that duplication may become a trouble or even nightmare of your system development and maintenance. If you want to store some pre-calculated cache values to improve performance, the calculation process of cache should be encapsulated and transparent to other processes.
In this case:
Solution 1: When one user votes on the poll, the detailed information will be recorded, and the vote count should be increased one automatically. (i.e. the cache calculation is encapsulated in data-writer process).
Solution 2: When the vote imformation is recoreded, nothing to do on the vote count, only a flag will be changed to mark the vote count value as dirty now. When the vote count is read, if its value is dirty, calculate it and update its value and the flag; if its value is latest (not dirty), read it directly. (i.e. the cache calculation is encapsulated in data-reader process).
Read Section 7 of the famous book The Pragmatic Programmer, you may get some ideas.
Actually, the Normal Forms used in database design is a special case of the DRY principle.
In short NO, there is no point to store data that can be fetched with a COUNT query and the second reason thet you have to manually manipulate the counter value - more work, bigger problem possibility, you have to maintain that code/algorithm. Really do NOT do it, it is a bad practice.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am starting to create my first web application in my career using mysql.
I am going to make table which contain users information (like id, firstname, lastname, email, password, phone number).
Which of the following is better?
Put all data into one single table (userinfo).
Divide all data by alphabet character and put data into many tables. for example, if user's email id is Joe#gmail.com that put into table (userinfo_j) and if user's email id is kevin#gmail.com that put into table (userinfo_k).
I don't want to sound condescending, but I think you should spend some time reading up on database design before tackling this project, especially the concept of normalization, which provides consistent and proven rules for how to store information in a relational database.
In general, my recommendation is to build your database to be easy to maintain and understand first and foremost. On modern hardware, a reasonably well-designed database with indexes running relational queries can support millions of records, often tens or hundreds of millions of records without performance problems.
If your database has a performance problem, tune the query first; add indexes second, buy better hardware third, and if that doesn't work, you may consider a design that makes the application harder to maintain (often called denormalization).
Your second solution will almost certainly be slower for most cases.
Relational databases are really, really fast when searching by indexed fields; searching for "email like 'Joe#gmail.com'" on a reasonable database will be too fast to measure on a database with tens of millions of records.
However, including the logic to find the right table in which to search will almost certainly be slower than searching in all the tables.
Especially if you want to search by things other than email address - imagine finding all the users who signed up in the last week. Or who have permission to do a certain thing in your application. Or who have a #gmail.com account.
So, the second solution is bad from a design/maintenance point of view, and will almost certainly be slower.
First one is better. In second you will have to write extra logic to find out which table you will start looking into. And for speeding up the search you can implement indexers. Here I suppose you will do equal operations more often rather than less than or more than operations so you can try implementing indexer with Hash. For comparison operations B-Tree are better.
Like others said, the first one is better. Specially if you need to add other tables in your database and link them to userĀ“s table, as the second one will soon get impossible to work and create relationships when your number of tables increase.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to build a activity stream which has the following structure :
------------------------------------------------------------------------------------
id | activity_by_user_id | activity_by_username | ... other activity related columns
------------------------------------------------------------------------------------
Is this a good approach to store the activity_by_username too in the activity table ? I understand that this will clutter up the table with the same username again and again. But If not, I will have to do a join with the users table to fetch the username.
The username in my web application never changes.
With this, I will no longer have to join this table with the users table. Is this an optimum way of achieving what I need ?
What you are proposing is to denormalize the data structure. There are advantages and disadvantages to this approach.
Clearly, you think that performance will be an advantage, because you will not need to look up the username on each row. This may not be true. The lookup should be on the primary key of the table and should be quite fast. There are even situations where storing the redundant information could slow down the query. This occurs when the field size is large and there are many apps with the same user. Then you are wasting lots of storage on redundant data, increasing the size of the table. Normally, though, you would expect to see a modest -- very modest -- improvement in performance.
Balanced against that is the fact that you are storing redundant data. So, if the user name were updated, then you would have to change lots of rows with the new information.
On balance, I would only advise you to go with such an approach if you tested it on real data in your environment and the performance improvement is worth it. I am skeptical that you would see much improvement, but the proof is in the pudding.
By the way, there are cases where denormalized data structures are needed to support applications. I don't think that looking up a field using a primary key is likely to be one of them.
There isn't a single answer to your question*
In general, relational database design seeks to avoid redundancy to limit the opportunities for data anomalies. For example, you now have the chance that two given rows might contain the same user id but different user names. Which one is correct? How do you prevent such discrepancies?
On the other hand, denormalization by storing certain columns redundantly is sometimes justified. You're right that you avoid doing a join because of that. But now it's your responsibility to make sure data anomalies don't creep in.
And was it really worth it? In MySQL, doing a join to look up a related row by its primary key is pretty efficient (you see this as a join type "eq_ref" in EXPLAIN). I wouldn't try to solve that problem until you can prove it's a bottleneck.
Basically, denormalization optimizes one type of query, at the expense of other types of queries. The extra work you do to prevent, detect, and correct data anomalies may be greater than any efficiency you gain by avoiding the join in this case. Or if usernames were to change sometimes, you'd have to change them in two places now (I know you said usernames don't change in your app).
The point is it depends entirely on your how frequently different queries are run by your application, so it's not something anyone can answer for you.
* That might explain why some people are downvoting your question -- some people in StackOverflow seem to have a rather strict idea about what is a "valid" question. I have seen questions closed or even deleted because they are too subjective and opinion-based. But I have also seen questions deleted because the answer is too "obvious". One of my answers with 100 upvotes was lost because a moderator thought the question of "Do I really need version control if I work solo?" was invalid. Go figure. I copied that one to my blog here.
I think it is bad idea. Databases are optimized for joins (assuming you did your job and indexed correctly) and denormalized data is notoriously hard to maintain. There may be no username changes now but can you guarantee that for the future, no. Risking your data integrity on such a thing is short-sighted at best.
Only denormalize in rare cases where there is an existing performance problem and other optimitization techniques have failed to improve the situation. Denormalizing isn't even always going to get you any performance improvement. As the tables get wider, it may even slow down performance. So don't do it unless you havea measuable performance problem and you measure and ensure the denormlaization actually helps. It is the last optimation technique to try out of all of them, so if you haven't gone through all the optimation techniques in the very large list of possibilities, first, then denormalization should not be an option.
No. This goes against all principles of data normalization.
And it won't even be that difficult (if I'm interpreting what you mean by id, user_id, and user_name); id will be the primary key tying everything together - and the linchpin of your JOINs. So you'll have your main table with id, other activity col, next activity col, etc. (not sure what you mean by activity). Then a 2nd table with just id and user_id and a third with id and username). And when you want to output whatever you're going to output, and do it by user_id or username, you'll just JOIN (implied join syntax - WHERE table1.id = table2.id).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Okay, so I have my user table ready with columns for all the technical information, such as username, profile picture, password and so on. Now I'm at a situation where I need to add superficial profile information, such as location, age, self-description, website, Facebook account, Twitter account, interests etc. In total, I calculated this would amount to 12 new columns, and since my user table already has 18 columns, I come at a crossroads. Other questions I read about this didn't really give a bottom-line answer of the method that is most efficient.
I need to find out if there is a more efficient way, and what is the most efficient way to store this kind of information? The base assumption being that my website would in the future have millions of users, so an option is needed that is able to scale.
I have so far concluded two different options:
Option 1: Store superficial data in user table, taking the total column count in users table up to 30.
Or
Option 2: Store superficial data in separate table, connecting that with Users table.
Which of these has better ability to scale? Which is more efficient? Is there a third option that is better than these two?
A special extra question also, if anyone has information about this; how do the biggest sites on the internet handle this? Thanks to anyone who participates with an answer, it is hugely appreciated.
My current databse is MySQL with rails mysql2 gem in Rails 4.
In your case, I would go with the second option. I suppose this would be more efficient because you would retrieve data from table 1 whenever the user logins and you would use data from table 2 (superficial data) whenever you change his preferences. You would not have to retrieve all data each time you want to do something. In the bottom line, I would suggest modelling your data according to your usage scenarios (use cases), creating data entities (eg tables) matching your use case entities. Then you should take into account the database normalization principles.
If you are interested on how these issues are handled by the biggest sites in the world, you should know that they do not use relational (SQL) databases. They actually use NoSQL databases, which run on a distributed function. This is a much more complicated scenario than yours. If you want to see related tools, you could start reading about Cassandra and hadoop.
Hope I helped!
If you will need to access to these 30 columns of information frequently, you could put all of them into the same table. That's what some widely-used CMS-es do because even though a row is big, it's faster to retrieve one big row than plenty of small rows on various tables (more SQL requests, more searches, more indexes, ...).
Also a good read for your problem is Database normalization.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm designing a largescale website.
for preventing duplicate voting, I'm going to store user's votes (userid , postid) in database (im using mysql now)
which one is better?
1- have just one row for each user and store all postid he'd voted as a text field there.
2- have one row for each vote and store it as integer.
thanks
A few notes:
The text-based approach has the potential to overflow. You're gonna have to set a maximum size for that text field, and, even then, a high enough number of votes will exceed capacity and cause your code to fail.
The text-based approach has the potential to be very slow. Suppose I've voted for 100,000 posts. Checking if I've voted for Post X will involve downloading those 100,000 post IDs, parsing them into an array, and checking the array for Post X's ID. That's gonna be way slower than an indexed query of SELECT 1 FROM votes WHERE user_id = X and post_id = Y LIMIT 1;, which will always run at almost exactly the same speed: pretty darn fast if it's indexed. (If not, it'll essentially do the same thing as your text-based approach and be super-slow, so indexing will be very important here!) Plus, note that if you go with MySQL's LONGTEXT to avoid issue #1, you stand to transfer up to 4GB of data each time you want to check for a single vote. Eww.
In my experience, your row-per-vote approach will actually be simpler to implement (especially once you get comfortable with SQL), will scale better, and will have many fewer ways in which it could break. There are scales at which relational databases become infeasible, but, for almost all users, using SQL to its full potential is the best way to get great performance.
The correct way to do this is your second option. Relational databases are good at this stuff.
Make a table called uservotes or something, with the primary key of userid and postid. That automatically prevents duplicate votes being added. It also means you can do:
SELECT SUM(vote) FROM uservotes WHERE postid=42;
Not that you would do that... You'd probably just store the total vote on the post itself.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm a bit newer to structuring databases and I was wondering if, say I have 38 different pieces of data that I want to have per record. Is it better to break that up into say a couple different tables or can I just keep it all in one table.
In this case I have a table of energy usage data for accounts, I have monthly usage, monthly demand, and demand percentage, then 2 identifying keys for each which comes out to 38 pieces of data for each record.
So is it good practice to break it up or should I just leave that all as one table? Also are there any effects on the efficiency of the product doing queries once this database ends up accumulating a couple thousand records at it peak?
Edit: I'm using Hibernate to query, I'm not sure if that would have any effect on the efficiency depending on how I end up breaking this data up.
First, check the normal forms:
1) Wiki
2) A Simple Guide to Five Normal Forms in Relational Database Theory
Second, aggregation data like "monthly sales" or "daily clicks" typically go to a separate tables. This is motivated not only by normal forms, but also by the implementation of the database.
For example, MySQL offers the Archive storage engine which is designed for that.
If you're watching current month's data, these may appear in the same table, or can be stored in cache. The per-month data in a separated table may be computed 1st day of month.
when you read a record do you use often all data? or you have different sections or masks (loaded separatly) to show energy usage data, monthly statistics and so on?
how many records do you plan to have on this table? If they grow dramatically and continually, is it possible create tables with a postfix for grouping them by period (for month, half year, year ...)?