Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm designing a largescale website.
for preventing duplicate voting, I'm going to store user's votes (userid , postid) in database (im using mysql now)
which one is better?
1- have just one row for each user and store all postid he'd voted as a text field there.
2- have one row for each vote and store it as integer.
thanks
A few notes:
The text-based approach has the potential to overflow. You're gonna have to set a maximum size for that text field, and, even then, a high enough number of votes will exceed capacity and cause your code to fail.
The text-based approach has the potential to be very slow. Suppose I've voted for 100,000 posts. Checking if I've voted for Post X will involve downloading those 100,000 post IDs, parsing them into an array, and checking the array for Post X's ID. That's gonna be way slower than an indexed query of SELECT 1 FROM votes WHERE user_id = X and post_id = Y LIMIT 1;, which will always run at almost exactly the same speed: pretty darn fast if it's indexed. (If not, it'll essentially do the same thing as your text-based approach and be super-slow, so indexing will be very important here!) Plus, note that if you go with MySQL's LONGTEXT to avoid issue #1, you stand to transfer up to 4GB of data each time you want to check for a single vote. Eww.
In my experience, your row-per-vote approach will actually be simpler to implement (especially once you get comfortable with SQL), will scale better, and will have many fewer ways in which it could break. There are scales at which relational databases become infeasible, but, for almost all users, using SQL to its full potential is the best way to get great performance.
The correct way to do this is your second option. Relational databases are good at this stuff.
Make a table called uservotes or something, with the primary key of userid and postid. That automatically prevents duplicate votes being added. It also means you can do:
SELECT SUM(vote) FROM uservotes WHERE postid=42;
Not that you would do that... You'd probably just store the total vote on the post itself.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
I'm doing a training point management function, I need to save all those points in the database so that it can be displayed when needed. And I created table for that function with 60 columns. Is that good? Or can anyone suggest me another way to handle it?
It is unusual but not impossible for a table to have that many columns, however...
It suggests that you schema might not be normalized. If that is the case then you will run into problems designing queries and/or making efficient use of the available resources.
Depending on how often each row is updated, the table could become fragmented. MySQL, like most DBMS, does not add up the size of all the attributes in the relation to work out the size to allocate for the record (although this is an option with C-ISAM). It rounds that figuere up so that there is some space for the data to grow, but at some point it could be larger than the space available, At that point the record must be migrated elsewhere. This leads to fragmentation in the data.
You queries are going to be very difficult to read/maintain. You may fall into the trap of writing "select * ...." which means that the DBMS needs to read the entirety of the record into memory in order to resolve the query. This does not make for efficient use of your memory.
We can't tell you if what you have done is correct, nor if you should be doing it differently without a detailed understanding of the underlying the data.
I've worked with many tables that had dozens of columns. It's usually not a problem.
In relational database theory, there is no limit to the number of columns in a table, as long as it's finite. If you need 60 attributes and they are all properly attributes of the candidate key in that table, then it's appropriate to make 60 columns.
It is possible that some of your 60 columns are not proper attributes of the table, and need to be split into multiple tables for the sake of normalization. But you haven't described enough about your specific table or its columns, so we can't offer opinions on that.
There's a practical limit in MySQL for how many columns it supports in a given table, but this is a limit of the implementation (i.e. MySQL internal code), not of the theoretical data model. The actual maximum number of columns in a table is a bit tricky to define, since it depends on the specific table. But it's almost always greater than 60. Read this blog about Understanding the Maximum Number of Columns in a MySQL Table for details.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a table with 96 columns . the problem is i get confused for create this table with a large amount of columns.
Don't do that, then!
It's rare to genuinely need a table with that many columns. Most likely, you will be able to split the data across multiple tables into a relational database. For example, if, in your long table, each record contains the name of a product, the price of the product, the store that sells the product, and the address of the store, you will usually want to have separate Stores and Products tables, probably with a many-to-many relationship between them.
To a large extent you can do so without much thought, by putting your database into some normal form, typically the third normal form. These normal forms are chosen to have nice properties when you want to insert, update, or delete a record. However, you usually have to think about the meaning of the data you store to find a decomposition that makes sense. A lack of repetitions in the initial data doesn't mean there won't be any later.
Read more
Those concepts are well explained in the Manga Guide to Databases.
This answer gives an example of a situation that requires partitioning, and another answer by the same user explains the performance benefits. (Besides not confusing oneself.)
But I need to!
In some odd situations, you might genuinely need a long table. Maybe you're starting a club for people who have exactly 95 names and so you need to store an identifier key (since there is no natural primary key in this case) and each of the names in order. In that case, you will have some test data you can use to immediately verify that the table has the correct format.
To avoid getting confused, it might help to use pen and paper (or a blackboard): write out the test data in the order that's most natural, find a reasonable name and format for each column, and then work off that when writing your table creation procedure. The line numbers in your editor should be enough to make sure you haven't skipped a column.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to build a activity stream which has the following structure :
------------------------------------------------------------------------------------
id | activity_by_user_id | activity_by_username | ... other activity related columns
------------------------------------------------------------------------------------
Is this a good approach to store the activity_by_username too in the activity table ? I understand that this will clutter up the table with the same username again and again. But If not, I will have to do a join with the users table to fetch the username.
The username in my web application never changes.
With this, I will no longer have to join this table with the users table. Is this an optimum way of achieving what I need ?
What you are proposing is to denormalize the data structure. There are advantages and disadvantages to this approach.
Clearly, you think that performance will be an advantage, because you will not need to look up the username on each row. This may not be true. The lookup should be on the primary key of the table and should be quite fast. There are even situations where storing the redundant information could slow down the query. This occurs when the field size is large and there are many apps with the same user. Then you are wasting lots of storage on redundant data, increasing the size of the table. Normally, though, you would expect to see a modest -- very modest -- improvement in performance.
Balanced against that is the fact that you are storing redundant data. So, if the user name were updated, then you would have to change lots of rows with the new information.
On balance, I would only advise you to go with such an approach if you tested it on real data in your environment and the performance improvement is worth it. I am skeptical that you would see much improvement, but the proof is in the pudding.
By the way, there are cases where denormalized data structures are needed to support applications. I don't think that looking up a field using a primary key is likely to be one of them.
There isn't a single answer to your question*
In general, relational database design seeks to avoid redundancy to limit the opportunities for data anomalies. For example, you now have the chance that two given rows might contain the same user id but different user names. Which one is correct? How do you prevent such discrepancies?
On the other hand, denormalization by storing certain columns redundantly is sometimes justified. You're right that you avoid doing a join because of that. But now it's your responsibility to make sure data anomalies don't creep in.
And was it really worth it? In MySQL, doing a join to look up a related row by its primary key is pretty efficient (you see this as a join type "eq_ref" in EXPLAIN). I wouldn't try to solve that problem until you can prove it's a bottleneck.
Basically, denormalization optimizes one type of query, at the expense of other types of queries. The extra work you do to prevent, detect, and correct data anomalies may be greater than any efficiency you gain by avoiding the join in this case. Or if usernames were to change sometimes, you'd have to change them in two places now (I know you said usernames don't change in your app).
The point is it depends entirely on your how frequently different queries are run by your application, so it's not something anyone can answer for you.
* That might explain why some people are downvoting your question -- some people in StackOverflow seem to have a rather strict idea about what is a "valid" question. I have seen questions closed or even deleted because they are too subjective and opinion-based. But I have also seen questions deleted because the answer is too "obvious". One of my answers with 100 upvotes was lost because a moderator thought the question of "Do I really need version control if I work solo?" was invalid. Go figure. I copied that one to my blog here.
I think it is bad idea. Databases are optimized for joins (assuming you did your job and indexed correctly) and denormalized data is notoriously hard to maintain. There may be no username changes now but can you guarantee that for the future, no. Risking your data integrity on such a thing is short-sighted at best.
Only denormalize in rare cases where there is an existing performance problem and other optimitization techniques have failed to improve the situation. Denormalizing isn't even always going to get you any performance improvement. As the tables get wider, it may even slow down performance. So don't do it unless you havea measuable performance problem and you measure and ensure the denormlaization actually helps. It is the last optimation technique to try out of all of them, so if you haven't gone through all the optimation techniques in the very large list of possibilities, first, then denormalization should not be an option.
No. This goes against all principles of data normalization.
And it won't even be that difficult (if I'm interpreting what you mean by id, user_id, and user_name); id will be the primary key tying everything together - and the linchpin of your JOINs. So you'll have your main table with id, other activity col, next activity col, etc. (not sure what you mean by activity). Then a 2nd table with just id and user_id and a third with id and username). And when you want to output whatever you're going to output, and do it by user_id or username, you'll just JOIN (implied join syntax - WHERE table1.id = table2.id).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Okay, so I have my user table ready with columns for all the technical information, such as username, profile picture, password and so on. Now I'm at a situation where I need to add superficial profile information, such as location, age, self-description, website, Facebook account, Twitter account, interests etc. In total, I calculated this would amount to 12 new columns, and since my user table already has 18 columns, I come at a crossroads. Other questions I read about this didn't really give a bottom-line answer of the method that is most efficient.
I need to find out if there is a more efficient way, and what is the most efficient way to store this kind of information? The base assumption being that my website would in the future have millions of users, so an option is needed that is able to scale.
I have so far concluded two different options:
Option 1: Store superficial data in user table, taking the total column count in users table up to 30.
Or
Option 2: Store superficial data in separate table, connecting that with Users table.
Which of these has better ability to scale? Which is more efficient? Is there a third option that is better than these two?
A special extra question also, if anyone has information about this; how do the biggest sites on the internet handle this? Thanks to anyone who participates with an answer, it is hugely appreciated.
My current databse is MySQL with rails mysql2 gem in Rails 4.
In your case, I would go with the second option. I suppose this would be more efficient because you would retrieve data from table 1 whenever the user logins and you would use data from table 2 (superficial data) whenever you change his preferences. You would not have to retrieve all data each time you want to do something. In the bottom line, I would suggest modelling your data according to your usage scenarios (use cases), creating data entities (eg tables) matching your use case entities. Then you should take into account the database normalization principles.
If you are interested on how these issues are handled by the biggest sites in the world, you should know that they do not use relational (SQL) databases. They actually use NoSQL databases, which run on a distributed function. This is a much more complicated scenario than yours. If you want to see related tools, you could start reading about Cassandra and hadoop.
Hope I helped!
If you will need to access to these 30 columns of information frequently, you could put all of them into the same table. That's what some widely-used CMS-es do because even though a row is big, it's faster to retrieve one big row than plenty of small rows on various tables (more SQL requests, more searches, more indexes, ...).
Also a good read for your problem is Database normalization.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm doing something different but this is an easier to understand example. Think of the votes here. I add these votes to a separate table and log information about them like by who, when and so on. Would you also add a field to the main table that simply counts the number of votes or is this bad practice.
This is called "denormalization" and is considered bad practice unless you get a significant performance boost when you denormalize.
The biggest issue with this, however, is concurrency. What happens if two people vote on the poll and they both try to increment the VoteCount column?
Search denormalization on here and in Google, there has been plenty of discussion on this topic. Find what fits your exact situation best, although, from the looks of it, denormalization would be premature optimization in your situation.
Bad.
Incorrect.
Guaranteed problems and data inconsistencies. The vote count is "derived data" and should not be stored (a duplicate). For stable data (that which does not change), summaries are fair enough.
Now if the data (no of votes) is large, and you need to count them often (in queries), then enhance that alone, the speed of the vote table from the main table, eg ensure there is an index on column being looked up for the count.
If the data is massive. Eg. a bank with millions of transactions per month, and you do not want to count them in order to produce the account balance on every query, enhance that alone. Eg. I calculate a month to date figure every night and store it at the account level; the days figure, needs to be counted, and added to the MTD figure, in order to produce the true up-to-the-minute figure. At the end of month, that month, when all the auditing processes are changing various rows across the month, the MTD figure (to yesterday) can be executed on demand.
The short answer is YES. But you should keep in mind that duplication may become a trouble or even nightmare of your system development and maintenance. If you want to store some pre-calculated cache values to improve performance, the calculation process of cache should be encapsulated and transparent to other processes.
In this case:
Solution 1: When one user votes on the poll, the detailed information will be recorded, and the vote count should be increased one automatically. (i.e. the cache calculation is encapsulated in data-writer process).
Solution 2: When the vote imformation is recoreded, nothing to do on the vote count, only a flag will be changed to mark the vote count value as dirty now. When the vote count is read, if its value is dirty, calculate it and update its value and the flag; if its value is latest (not dirty), read it directly. (i.e. the cache calculation is encapsulated in data-reader process).
Read Section 7 of the famous book The Pragmatic Programmer, you may get some ideas.
Actually, the Normal Forms used in database design is a special case of the DRY principle.
In short NO, there is no point to store data that can be fetched with a COUNT query and the second reason thet you have to manually manipulate the counter value - more work, bigger problem possibility, you have to maintain that code/algorithm. Really do NOT do it, it is a bad practice.