remember a data set to avoid redundancy

remember a data set to avoid redundancy - mysql

a user will be answering random questions (pulled from DB) and once he answers a question, he shall no longer be asked to answer it again.
So I need a way to remember the questions answered by the application users.
I was thinking about storing the answered questions in a separated table along with the user_id, but given the nature of the app (quick "yes" "no" questions), every user might end up answering thousands of questions. I'm not sure updating and accessing such a large table in a regular basis is a good solution.
Any other suggestions ?

Thousands of answers from one user? Really?
If that's a real possibility, you can store the set of answered questions as a Bloom filter which can be serialized to a fixed size; see PyBloom for one implementation, that can easily be adopted to store the filter state in a
BLOB or VARBINARY column.
If the users will answer several questions in one session, you'll probably want to keep a copy of the current status in memory, as well as persisting it when it changes.
Aren't you going to keep track of the answers? Why can't you just query the existing answers for a user to select an unanswered question? If I were doing this, I'd try the simplest solution first, and only go for the fancy solution when (and if) the simple solution ran out of gas.

Related

Large MySQL Database with Small Data

I am designing a database that will be based around a group progress quiz. The quiz consists of 55 questions, and ideally a group of 10 people will take the quiz every few weeks, each person taking it once for everyone in the group, including themselves. So, each time the group takes the quiz, 100 pieces of data will be added to the database.
Currently my table for storing the quiz answers will have the following rows:
quiz_taker_id // person taking the quiz
quiz_subject_id // taker is answering questions about this person
quiz_id // identifies if this is the 1st time taking the quiz, 2nd time, etc
question1 // answer to question 1
question2 // answer to question 2
... // etc, for all quiz questions
The quiz answers are incredibly simple, its just a ration of 0-5 on a person's characteristics. Is this a good way to be storing this data? Are there better ways to do this? I am just starting to set up the website and DB, so I want to make sure I am approaching this the right way

Whenever you want to process data in any way (like make postgame stats) it is a good idea to use databases. Your db design is very simple and lacks some flexibility, like say, you want to add more questions later (now you have to add extra column).
So it really depends on what you plan to do with the collected data and if you plan to extend your quiz rules.

This is too long for a comment.
The questions should be in a single table, questions not one table per question. A basic questions table would have each question and its correct answer. This is probably good enough for your problem.
For surveys (and for quizzes, I imagine), there is a versioning problem, because questions can slowly change over time. As a somewhat trivial example, you might start start by asking "What is your gender?" and initially offer two answers "Male", "Female". Over time, you might start adding additional other answers: "Other", "Transsexual", "Hermaphrodite" and so on. When analyzing the answers, you might need to know the version of the question that was asked at a particular time.
This is a survey example, where there is no right answer, but a similar idea applies to quizzes: the questions and answers might evolve somewhat over time, but you still want them to be recognized at Question 2, but you want to know the version being asked.

Most efficient way to store user profile information [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Okay, so I have my user table ready with columns for all the technical information, such as username, profile picture, password and so on. Now I'm at a situation where I need to add superficial profile information, such as location, age, self-description, website, Facebook account, Twitter account, interests etc. In total, I calculated this would amount to 12 new columns, and since my user table already has 18 columns, I come at a crossroads. Other questions I read about this didn't really give a bottom-line answer of the method that is most efficient.
I need to find out if there is a more efficient way, and what is the most efficient way to store this kind of information? The base assumption being that my website would in the future have millions of users, so an option is needed that is able to scale.
I have so far concluded two different options:
Option 1: Store superficial data in user table, taking the total column count in users table up to 30.
Or
Option 2: Store superficial data in separate table, connecting that with Users table.
Which of these has better ability to scale? Which is more efficient? Is there a third option that is better than these two?
A special extra question also, if anyone has information about this; how do the biggest sites on the internet handle this? Thanks to anyone who participates with an answer, it is hugely appreciated.
My current databse is MySQL with rails mysql2 gem in Rails 4.

In your case, I would go with the second option. I suppose this would be more efficient because you would retrieve data from table 1 whenever the user logins and you would use data from table 2 (superficial data) whenever you change his preferences. You would not have to retrieve all data each time you want to do something. In the bottom line, I would suggest modelling your data according to your usage scenarios (use cases), creating data entities (eg tables) matching your use case entities. Then you should take into account the database normalization principles.
If you are interested on how these issues are handled by the biggest sites in the world, you should know that they do not use relational (SQL) databases. They actually use NoSQL databases, which run on a distributed function. This is a much more complicated scenario than yours. If you want to see related tools, you could start reading about Cassandra and hadoop.
Hope I helped!

If you will need to access to these 30 columns of information frequently, you could put all of them into the same table. That's what some widely-used CMS-es do because even though a row is big, it's faster to retrieve one big row than plenty of small rows on various tables (more SQL requests, more searches, more indexes, ...).
Also a good read for your problem is Database normalization.

many integer records or one text field [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm designing a largescale website.
for preventing duplicate voting, I'm going to store user's votes (userid , postid) in database (im using mysql now)
which one is better?
1- have just one row for each user and store all postid he'd voted as a text field there.
2- have one row for each vote and store it as integer.
thanks

A few notes:
The text-based approach has the potential to overflow. You're gonna have to set a maximum size for that text field, and, even then, a high enough number of votes will exceed capacity and cause your code to fail.
The text-based approach has the potential to be very slow. Suppose I've voted for 100,000 posts. Checking if I've voted for Post X will involve downloading those 100,000 post IDs, parsing them into an array, and checking the array for Post X's ID. That's gonna be way slower than an indexed query of SELECT 1 FROM votes WHERE user_id = X and post_id = Y LIMIT 1;, which will always run at almost exactly the same speed: pretty darn fast if it's indexed. (If not, it'll essentially do the same thing as your text-based approach and be super-slow, so indexing will be very important here!) Plus, note that if you go with MySQL's LONGTEXT to avoid issue #1, you stand to transfer up to 4GB of data each time you want to check for a single vote. Eww.
In my experience, your row-per-vote approach will actually be simpler to implement (especially once you get comfortable with SQL), will scale better, and will have many fewer ways in which it could break. There are scales at which relational databases become infeasible, but, for almost all users, using SQL to its full potential is the best way to get great performance.

The correct way to do this is your second option. Relational databases are good at this stuff.
Make a table called uservotes or something, with the primary key of userid and postid. That automatically prevents duplicate votes being added. It also means you can do:
SELECT SUM(vote) FROM uservotes WHERE postid=42;
Not that you would do that... You'd probably just store the total vote on the post itself.

Should I add total count in a the main relational table [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm doing something different but this is an easier to understand example. Think of the votes here. I add these votes to a separate table and log information about them like by who, when and so on. Would you also add a field to the main table that simply counts the number of votes or is this bad practice.

This is called "denormalization" and is considered bad practice unless you get a significant performance boost when you denormalize.
The biggest issue with this, however, is concurrency. What happens if two people vote on the poll and they both try to increment the VoteCount column?
Search denormalization on here and in Google, there has been plenty of discussion on this topic. Find what fits your exact situation best, although, from the looks of it, denormalization would be premature optimization in your situation.

Bad.
Incorrect.
Guaranteed problems and data inconsistencies. The vote count is "derived data" and should not be stored (a duplicate). For stable data (that which does not change), summaries are fair enough.
Now if the data (no of votes) is large, and you need to count them often (in queries), then enhance that alone, the speed of the vote table from the main table, eg ensure there is an index on column being looked up for the count.
If the data is massive. Eg. a bank with millions of transactions per month, and you do not want to count them in order to produce the account balance on every query, enhance that alone. Eg. I calculate a month to date figure every night and store it at the account level; the days figure, needs to be counted, and added to the MTD figure, in order to produce the true up-to-the-minute figure. At the end of month, that month, when all the auditing processes are changing various rows across the month, the MTD figure (to yesterday) can be executed on demand.

The short answer is YES. But you should keep in mind that duplication may become a trouble or even nightmare of your system development and maintenance. If you want to store some pre-calculated cache values to improve performance, the calculation process of cache should be encapsulated and transparent to other processes.
In this case:
Solution 1: When one user votes on the poll, the detailed information will be recorded, and the vote count should be increased one automatically. (i.e. the cache calculation is encapsulated in data-writer process).
Solution 2: When the vote imformation is recoreded, nothing to do on the vote count, only a flag will be changed to mark the vote count value as dirty now. When the vote count is read, if its value is dirty, calculate it and update its value and the flag; if its value is latest (not dirty), read it directly. (i.e. the cache calculation is encapsulated in data-reader process).
Read Section 7 of the famous book The Pragmatic Programmer, you may get some ideas.
Actually, the Normal Forms used in database design is a special case of the DRY principle.

In short NO, there is no point to store data that can be fetched with a COUNT query and the second reason thet you have to manually manipulate the counter value - more work, bigger problem possibility, you have to maintain that code/algorithm. Really do NOT do it, it is a bad practice.

Schema design for when users can define fields

Greetings stackers,
I'm trying to come up with the best database schema for an application that lets users create surveys and present them to the public. There are a bunch of "standard" demographic fields that most surveys (but not all) will include, like First Name, Last Name, etc. And of course users can create an unlimited number of "custom" questions.
The first thing I thought of is something like this:
Survey
ID
SurveyName
SurveyQuestions
SurveyID
Question
Responses
SurveyID
SubmitTime
ResponseAnswers
SurveyID
Question
Answer
But that's going to suck every time I want to query data out. And it seems dangerously close to Inner Platform Effect
An improvement would be to include as many fields as I can think of in advance in the responses table:
Responses
SurveyID
SubmitTime
FirstName
LastName
Birthdate
[...]
Then at least queries for data from these common columns is straightforward, and I can query, say, the average age of everyone who ever answered any survey where they gave their birthdate.
But it seems like this will complicate the code a bit. Now to see which questions are asked in a survey I have to check which common response fields are enabled (using, I guess, a bitfield in Survey) AND what's in the SurveyQuestions table. And I have to worry about special cases, like if someone tries to create a "custom" question that duplicates a "common" question in the Responses table.
Is this the best I can do? Am I missing something?

Your first schema is the better choice of the two. At this point, you shouldn't worry about performance problems. Worry about making a good, flexible, extensible design. There are all sorts of tricks you can do later to cache data and make queries faster. Using a less flexible database schema in order to solve a performance problem that may not even materialize is a bad decision.
Besides, many (perhaps most) survey results are only viewed periodically and by a small number of people (event organizers, administrators, etc.), so you won't constantly be querying the database for all of the results. And even if you were, the performance will be fine. You would probably paginate the results somehow anyway.
The first schema is much more flexible. You can, by default, include questions like name and address, but for anonymous surveys, you could simply not create them. If the survey creator wants to only view everyone's answers to three questions out of five hundred, that's a really simple SQL query. You could set up a cascading delete to automatically deleting responses and questions when a survey is deleted. Generating statistics will be much easier with this schema too.
Here is a slightly modified version of the schema you provided. I assume you can figure out what data types go where :-)
surveys
survey_id (index)
title
questions
question_id (index, auto increment)
survey_id (link to surveys->survey_id)
question
responses
response_id (index, auto increment)
survey_id (link to surveys->survey_id)
submit_time
answers
answer_id (index, auto increment)
question_id (link to questions-question_id)
answer

I would suggest you always take a normalized approach to your database schema and then later decided if you need to create a solution for performance reasons. Premature optimization can be dangerous. Premature database de-normalization can be disastrous!
I would suggest that you stick with the original schema and later, if necessary, create a reporting table that is a de-normalized version of your normalized schema.

One change that may or may not help simplify things would be to not link the ResponseAnswers back to the SurveyID. Rather, create an ID per response and per question and let your ResponseAnswers table contain the fields ResponseID, QuestionID, Answer. Although this would require keeping unique Identifiers for each unit it would help keep things a little bit more normalized. The response answers do no need to associate with the survey they were answering just the specific question they are answering and the response information that they are associated.

I created a customer surveys system at my previous job and came up with a schema very similar to what you have. It was used to send out surveys (on paper) and tabulate the responses.
A couple of minor differences:
Surveys were NOT anonymous, and this was made very clear in the printed forms. It also meant that the demographic data in your example was known in advance.
There was a pool of questions which were attached to the surveys, so one question could be used on multiple surveys and analyzed independently of the survey it appeared on.
Handling different types of questions got interesting -- we had a 1-3 scale (e.g., Worse/Same/Better), 1-5 scale (Very Bad, Bad, OK, Good, Very Good), Yes/No, and Comments.
There was special code to handle the comments, but the other question types were handled generically by having a table of question types and another table of valid answers for each type.
To make querying easier you could probably create a function to return the response based on a survey ID and question ID.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008