Storing duplicate fields: good or bad [closed] - mysql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Let's say a user has posts table like this:
Post with id=1 is the first post that a user has posted. Post with an id=2 – is the edit that was made to the post, with id=3 – latest current version of the post.
post_param_a cannot be changed throughout versions, as well as user_id – they always stay the same since the first version. So we could store it like this:
So the question is: would it be better to store it the second way, with no duplication? This way, to get a current version of user's post we'd have to join the first version and check its user_id all the time. Or is it okay to store duplicate fields in this case?
p.s. this is questioned because we want to avoid duplication and accident changes of values that cannot be changed throughout versions, so we want to be storing them all in one place

Take the entity Post and look at the simple tuple:
ID User_ID Post_Param_A Comment
1 69 foo This is a post
This is perfectly normalized. However, the post may undergo editing and you want to track the changes made. So you add another field to track the changes. Instead of an incremental value, however, it would make more sense to add a datetime field.
ID EffDate User_ID Post_Param_A Comment
1 1/1/16 12:00 69 foo This is a post
This has two advantages: 1) if you track the changes, you will want to know anyway when this version was saved and 2) you don't have to find the largest incremental value for the post to find out what value to save with each new version. Just save the current date and time.
However, with either an incremental value or date, there is a problem. In the simple row, each field has a function dependency on the PK. In the version row, User_ID and Post_Param_A maintain their dependency on the PK but Comment is now dependent on the PK and EffDate.
The tuple is no longer in 2nf.
So the solution is a simple matter of normalizing it:
ID User_ID Post_Param_A
1 69 foo
ID EffDate Comment
1 1/1/16 12:00 This is a post
1 1/1/17 12:00 An edit was made
1 1/1/17 15:00 The last and current version (so far)
with (ID, EffDate) the composite PK in the new table.
The query to read the latest post is a bit complicated:
select p.ID, v.EffDate, p.User_ID, p.Post_Param_A, v.Comment
from Posts p
join PostVersions v
on v.ID = p.ID
and v.EffDate = (
select Max( v1.EffDate )
from PostVersions v1
where v1.ID = p.ID
and v1.EffDate <= today )
and p.ID = 1;
This is not really as complicated as it looks and it is impressively fast. The really neat feature is -- if you replace "today" with, say, 1/1/17 13:00, the result will be the second version. So you can query the present or the past using the same query.
Another neat feature is achieved by creating a view from the "today" query with the last line ("and p.ID = 1") removed. This view will expose the latest version of all posts. Create triggers on the view and this allows the apps that are only interested in the current version to do their work without consideration of the underlying structure.

You could have a separate table where you store the post_param_a for each post_id, then you wouldn't need to have NULL values or duplicate values.

The 1st solution is better because user_id is aligned with the post_id and avoid various interpretations.
This way, to get a current version of user's post we'd have to join the first version and check its user_id all the time.
Do you think about adding a field timestamp, so that you can always get the last version of a post?
In the 2nd solution, NULL could be ambiguous when the data grow. And even querying will be difficult, every SQL should be well designed to think about the NULL cases and their specific meanings.
The 3rd solution could be a normalization of your table using 2 separated ones, e.g. post and post_history. As you mentioned in the question that post_param_a cannot be changed throughout versions, as well as user_id – they always stay the same since the first version. In this case,
In table post, you can store information related to the post which are permanent (won't be changed): id, param_a, user_id, created_at ...
In table post_history, you can store informations related to the post which are related to each version / modification: version_id, comment, modified_at ... And you can add a FK constraint for the second table which indicates post_history.post_id = post.id

Related

MySQL subquery with if conditionals

I'm querying a table from a third party plugin so I don't have control over how the data is being inputted. It's a course plugin broken into 4 different quizzes. One of the questions is being used in all four quizzes.
THERE IS NO "quiz_id" which is why I think they only way to query data is with some sort of if conditional. There IS a date field and a unique id field.
This is what my subquery looks like:
(SELECT y.post_content
FROM wp_posts AS y
WHERE 137 = y.post_author
AND y.post_title LIKE '%If you checked "Other" in "What are your organization’s primary goals related to hiring people with autism?", please explain%'
ORDER BY y.post_author ASC limit 1)
This works to query the answer (y.post_content) for the first quiz, but not for all 4 quizzes. Is there a conditional I can use for this? i.e. let's say i'm querying results for the 2nd quiz: if there are four answers, pick the 2nd one,if there are 3 answers pick the 2nd one, if there are 2 answers, pick the most recent one

Liked Posts Design Specifics

So I've found through researching myself that the best way I can design a structure for liking posts is by having a database like the following. Let's say like Reddit, a post can be upvoted, downvoted, or not voted on at all.
The database would then having three columns, [username,post,liked].
Liked could be some kind of boolean, 1 indicating liked, and 0 indicating disliked.
Then to find a post like amount, I would do SELECT COUNT(*) FROM likes WHERE post=12341 AND liked=1 for example, then do the same for liked=0(disliked), and do the addition server side along with controversy percentage.
So I have a few concerns, first off, what would be the appropriate way to find out if a user liked a post? Would I try to select the liked boolean value, and either retrieve or catch error. Or would I first check if the record exist, and then do another select to find out the value? What if I want to check if a user liked multiple posts at once?
Secondly, would this table not need a primary key? Because no row will have the same post and username, should I use a compound primary key?
For performance you will want to alter your database plans:
User Likes Post table
Fields:
Liked should be a boolean, you are right. You can transform this to -1/+1 in your code. You will cache the numeric totals elsewhere.
Username should be UserID. You want only numeric values in this table for speed.
Post should be PostID for the same reason.
You also want a numeric primary key because they're easier to search against, and to perform sub-selects with.
And create a unique index on (Username, Post), because this table is mainly an index built for speed.
So did a user vote on a post?
select id
from user_likes_post
where userID = 123 and postID = 456;
Did the user like the post?
select id
from user_likes_post
where userID = 123 and postID = 456 and liked = true;
You don't need to worry about errors, you'll either get results or you won't, so you might as well go straight to the value you're after:
select liked from user_liked_post where userID=123 and postID=456
Get all the posts they liked:
select postID
from user_likes_post
where userID = 123 and liked = true;
Post Score table
PostID
TotalLikes
TotalDislikes
Score
This second table will be dumped and refreshed every n minutes by calculating on the first table. This second table is your cached aggregate score that you'll actually load for all users visiting that post. Adjust the frequency of this repeat dump-and-repopulate schedule however you see fit. For a small hobby or student project, just do it every 30 seconds or 2 minutes; bigger sites, every 10 or 15 minutes. For an even bigger site like reddit, you'd want to make the schema more complex to allow busier parts of the site to have faster refresh.
// this is not exact code, just an outline
totalLikes =
select count(*)
from user_likes_post
where postID=123 and liked=true
totalDislikes =
select count(*)
from user_likes_post
where postID=123 and liked=false
totalVotes = totalLikes + totalDislikes
score = totalLikes / totalVotes;
(You can simulate an update by involving the user's localStorage -- client-side Javascript showing a bump-up or down on the posts that user has voted on.)
Given your suggested 3-column table and the selects you suggest, be sure to have
PRIMARY KEY(username, post) -- helps with "did user like a post"
INDEX(post_id, liked) -- for that COUNT
When checking whether a user liked a post, either do a LEFT JOIN so that you get one of three things: 1=liked, 0=unliked, or NULL=not voted. Or you could use EXISTS( SELECT .. )
Tables need PKs.
I agree with Rick James that likes table should be uniquely indexed by (username, post) pair.
Also I advise you to let a bit redundancy and keep the like_counter in the posts table. It will allow you to significantly reduce the load on regular queries.
Increase or decrease the counter right after successful adding the like/dislike record.
All in all,
to get posts with likes: plain select of posts
no need to add joins and aggregate sub-queries.
to like/dislike: (1) insert into likes, on success (2) update posts.like_counter.
unique index prevents duplication.
get know if user has already liked the post: select from likes by username+post pair.
index helps to do it fast
My initial thought was that the problem is because boolean type is not rich enough to express the possible reactions to a post. So instead of boolean, you needed an enum with possible states of Liked, Disliked, and the third and the default state of Un-reacted.
Now however it seems, you can do away with boolean too because you do not need to record the Un-reacted state. A lack of reaction means that you do not add the entry in the table.
What would be the appropriate way to find out if a user liked a post?
SELECT Liked
FROM Likes
WHERE Likes.PostId == 1234
AND Likes.UserName == "UniqueUserName";
If the post was not interacted with by the user, there would be no results. Otherwise, 1 if liked and 0 if disliked.
What if I want to check if a user liked multiple posts at once?
I think for that you need to store a timestamp too. You can then use that timestamp to see if it there are multiple liked post within a short duration.
You could employ k-means clustering to figure if there are any "cluster" of likes. The complete explanation is too big to add here.
Would this table not need a primary key?
Of course it would. But Like is a weak entity depending upon the Post. So it would require the PK of Post, which is the field post (I assume). Combined with username we would have the PK because (post, username) would be unique for user's reaction.

How should I store another table's row in order to have statistics data in the frontend?

I have a table full of businesses each with a scannable QR Code, and another table that stores the scans the users make. Right now, the scan table schema looks like this:
id | user_id | business_id | scanned_date
If I want to create charts and analytics in the front-end of my Application for statistics about business scans I'd just get the business_id and get the business info with it, but the problem is that if a business' data is ever changed then the statistical data will also change, and it shouldn't be this way.
The first thing that came to my mind in order to have static data was to store the whole business row as a JSON String in a new column in the scan table, but it doesn't sound like a good practice. (Although storing JSON String is not advised against if the data won't be tampered with, which won't since it's supposed to be static).
Another thing that I thought of was to make a clone table out of the business table's schema, but that'd mean to work twice whenever I want to make changes to the original one since I must also change the cloned one.
You need a way to represent the history of the businesses' data in your database.
You didn't mention what attributes you store in each business's row, so I will guess. Let's say you have these columns
business_id
name
category
qr_code
website
Your problem is this: if you change any attribute of the business, the old value vanishes.
Here's a solution to that problem. Add start and end columns to the table. They should probably have TIMESTAMP data types.
Then, never DELETE rows from the table. When you UPDATE them, only change the value of the end column. Instead add new rows. Let me explain.
For a row to be active at the time NOW(), it must pass these WHERE criteria:
start_date >= NOW()
AND (end_date IS NULL OR end_date < NOW())
Let's say you start with two businesses in the table.
business_id start end name category qr_code website
1 2019-05-01 NULL Joe's tavern lkjhg12 joes.example.com
2 2019-05-01 NULL Acme rockets sdlfj48 acme.example.com
Good: You can count QR code scans day by day with this query
SELECT COUNT(*), DATE(s.scanned_date) day, b.name
FROM business b
JOIN scan s ON b.business_id = s.business_id
AND b.start >= s.scanned_date
AND (b.end IS NULL OR b.end < s.scanned_date)
GROUP BY DATE(s.scanned_date), b.name
Now, suppose Joe sells his tavern and its name changes. To represent that change you must UPDATE the existing row for Joe's to set the end date, and then INSERT a new row with the new data. Afterward, your table looks like this
business_id start end name category qr_code website
(updated) 1 2019-05-01 2019-05-24 Joe's tavern lkjhg12 joes.example.com
(inserted) 1 2019-05-24 NULL Fancy tavern lkjhg12 fancy.example.com
(unchanged) 2 2019-05-01 NULL Acme rockets sdlfj48 acme.example.com
The query above still works, because it takes into account the start and end dates of the changes.
This approach works best when you have many more scans than changes to businesses. That seems likely in this case.
Your business table needs a composite primary key (business_id, start).
Prof. Richard Snodgrass wrote a book on this subject, Developing Time-Oriented Database Applications in SQL, and generously made a pdf available.
I hope I got your question.
You could try having duplicates in the business table. Instead of editing the business, try adding a new one with a new id. When you are editing your business, instead of updating the existing one, you can INSERT a new one. The stats will use the old id and will not get affected by the changes. When you are trying to get the last business info, try sorting them according to their ids to get the last one. That way you won't need a second table for business data.
Edit: If the business id needs to be specific to a business, instead of using the business id, you can add a column that represents the insertion of data to the table. Again, you can use sorting limiting the query to get the last one.
Edit 2:
Removing entities that were inserted a certain amount of time ago
If you don't need the stats from a month ago, you could remove them from businesses to save up space. You can use the new time column you created to get the time difference and check if it is greater than the range you want.

The optimal way to store multiple-selection survey answers in a database

I'm currently working on a survey creation/administration web application with PHP/MySQL. I have gone through several revisions of the database tables, and I once again find that I may need to rethink the storage of a certain type of answer.
Right now, I have a table that looks like this:
survey_answers
id PK
eid
sesid
intvalue Nullable
charvalue Nullable
id = unique value assigned to each row
eid = Survey question that this answer is in reply to
sesid = The survey 'session' (information about the time and date of a survey take) id
intvalue = The value of the answer if it is a numerical value
charvalue = the value of the answer if it is a textual representation
This allowed me to continue using MySQL's mathematical functions to speed up processing.
I have however found a new challenge: storing questions that have multiple responses.
An example would be:
Which of the following do you enjoy eating? (choose all the apply)
Girl Scout Cookies
Bacon
Corn
Whale Fat
Now, when I want to store the result, I'm not sure of the best way to handle it.
Currently, I have a table just for multiple choice options that looks like this:
survey_element_options
id PK
eid
value
id = unique value associated with each row
eid = question/element that this option is associated with
value = textual value of that option
With this setup, I then store my returned multiple selection answers in 'survey_answers' as strings of comma separated id's of the element_options rows that were selected in the survey. (ie something like "4,6,7,9") I'm wondering if that is indeed the best solution, or if it would be more practical to create a new table that would hold each answer chosen, and then reference back to a given answer row which in turn references back to the element and ultimately the survey.
EDIT
for anyone interested, here is the approach I ended up taking (In PhpMyAdmin Relations View):
And a rudimentary query to gather the counts for a multiple select question would look like this:
SELECT e.question AS question, eo.value AS value, COUNT(eo.value) AS count
FROM survey_elements e, survey_element_options eo, survey_answer_options ao
WHERE e.id = 19
AND eo.eid = e.id
AND ao.oid = eo.id
GROUP BY eo.value
This really depends on a lot of things.
Generally, storing lists of comma separated values in a database is bad, especially if you plan to do anything remotely intelligent with that data. Especially if you want to do any kind of advanced reporting on the answers.
The best relational way to store this is to also define the answers in a second table and then link them to the users response to a question in a third table (with multiple entries per user-question, or possibly user-survey-question if the user could take multiple surveys with the same question on it.
This can get slightly complex as a a possible scenario as a simple example:
Example tables:
Users (Username, UserID)
Questions (qID, QuestionsText)
Answers (AnswerText [in this case example could be reusable, but this does cause an extra layer of complexity as well], aID)
Question_Answers ([Available answers for this question, multiple entries per question] qaID, qID, aID),
UserQuestionAnswers (qaID, uID)
Note: Meant as an example, not a recommendation
Convert primary key to not unique index and add answers for the same question under the same id.
For example.
id | eid | sesid | intval | charval
3 45 30 2
3 45 30 4
You can still add another column for regular unique PK if needed.
Keep things simple. No need for relation here.
It's a horses for courses thing really.
You can store as a comma separated string (But then what happens when you have a literal comma in one of your answers).
You can store as a one-to-many table, such as:
survey_element_answers
id PK
survey_answers_id FK
intvalue Nullable
charvalue Nullable
And then loop over that table. If you picked one answer, it would create one row in this table. If you pick two answers, it will create two rows in this table, etc. Then you would remove the intvalue and charvalue from the survey_answers table.
Another choice, since you're already storing the element options in their own table, is to create a many-to-many table, such as:
survey_element_answers
id PK
survey_answers_id FK
survey_element_options_id FK
Again, one row per option selected.
Another option yet again is to store a bitmask value. This will remove the need for a many-to-many table.
survey_element_options
id PK
eid FK
value Text
optionnumber unique for each eid
optionbitmask 2 ^ optionnumber
optionnumber should be unique for each eid, and increment starting with one. There will impose a limit of 63 options if you are using bigint, or 31 options if you are using int.
And then in your survey_answers
id PK
eid
sesid
answerbitmask bigint
Answerbitmask is calculated by adding all of the optionbitmask's together, for each option the user selected. For example, if 7 were stored in Answerbitmask, then that means that the user selected the first three options.
Joins can be done by:
WHERE survey_answers.answerbitmask & survey_element_options.optionbitmask > 0
So yeah, there's a few options to consider.
If you don't use the id as a foreign key in another query, or if you can query results using the sesid, try a many to one relationship.
Otherwise I'd store multiple choice answers as a serialized array, such as JSON or through php's serialize() function.

MySQL select users on multiple criteria

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.