Is having duplicate Database values better than querying more times? - mysql

Consider a table called users and a table called votes.
A user has an id and a country column.
Every vote belongs to a user, but the purpose when retrieveing the vote is to find out which country it came from. Therefore you would need to query once to get the vote, and query the users table after that to get country.
Considering a large, many-times queried database, Is it better to just add a country column for the votes table and have it be a duplicate for the one in users or to just use the method above?

Yes. No. Maybe.
The answer to your question depends on several things that you don't mention in the question. The first thing to note is that the query in VKP's answer is quite sufficient under most circumstances.
Second, if country is a full country name, then storing the full country name (which can be rather long) may greatly expand the size of the table. This increase in size may actually slow down certain queries, versus doing the join. Of course, this would be much less significant for 2- or 3- character codes or if the width of the records in votes is already several hundred bytes.
But, perhaps the most important consideration is whether you want the vote counted on the users current country or do you want the vote counted at the country assigned to the user when the vote was made? The first option says to always use a join to get the current value. The second is a very strong argument for including country in the votes table.

select v.vote_id, u.country
from users u join votes v
on u.id = v.userid
If you need to see the country from which a vote was, you can join the tables and get it. Also, it is not suggested to include a country column in the votes table as it doesn't make sense.

The way you have explained it, country is an attribute of user: user "lives in" or "is citizen of" a country. Vote is an action that may be taken by users: users cast votes.
How is it that you have a vote under consideration without already knowing the user? How was this vote selected in the first place? There must be some other detail(s) that you have omitted.
If you are searching for aggregate values ("How many votes were cast by Canadians during July?") then you have to join the tables anyway -- filtering on users only in Canada and votes only during July. A query for "In which countries did any citizens cast at least one vote in July?" would be trickier to code, but still requires a join.
The join needed by the latter question could be eliminated by duplicating the country to the Votes table. But I don't think any performance improvement would be significant and you must remember that you will have made your database a little more complicated, a little less maintainable and a little less robust. It would have to be quite a large performance boost to make all that worthwhile.

Related

Is it eficient to merge similar data objects into single table?

I need to store data a lot of similar data about my system of questions and the answer such as voting, following, bookmarks, etc.
In example of voting, what is the best table layout for storing votes for questions, answers, and posts?
Store the votes separately, that is, 3 tables are obtained: UserQuestionVotes, UserAnswerVotes and UserPostVotes
Store votes in one table:
UserVotes (id, user_id, item_id, item_type, vote),
while: item_id and item_type is the id and type of the question, answer or post, vote = -1/1
If I go the first way, I will have at least 9 tables.
And if I go the second way, that is, all the data in one heap, so in the future, when filling out the table, it will work more slowly.
Which way in my case eficient?
If you're looking for my opinion, I would pick door #1. Questions, Answers, and Posts are all separate, albeit related, "things." And, each of these "things" happen to also have "votes" associated with them ... but, really, a "vote" is not a "thing."
A "vote for a question" is tightly associated with "the question." "A vote for ..." anything else is the same. So now I start thinking about the queries I'm most likely to actually write. I'm most likely to want to write queries that, say, count how many votes a particular question has ... and I don't really want to muddy-up that query and make it either "hard to write" or obliged to look through a bunch of records that are not "votes for questions." The other types of votes wouldn't be relevant and I'd rather not have to filter them out. (If I need to write a query to count "how many votes for anything has this user cast?", I could very easily write that regardless.)
That's my opinion. (The database manager can take care of "efficiency" on its own. Design your database so that the queries you need to write are easy and clear to write.)

Extremely basic SQL Misunderstanding

I'm preparing for an exam in databases and SQL and I'm solving an exercise:
We have a database of 4 tables that represent a human resources company. The tables are:
applicant(a-id,a-name,a-city,years-of-study),
job(job-name,job-id),
qualified(a-id,job-id)
wish(a-id,job-id).
the table applicant represents the table of applicants obviously. And jobs is the table of available jobs. the table qualified shows what jobs a person is qualified for, and the table wish shows what jobs a person is interested in.
The question was to write a query that displays for each job-id, the number of applicants that are both qualified and interested to work in.
Here is the solution the teacher wrote:
Select q1.job_id
, count(q1.a_id)
from qualified as q1
, wish as w1
Where q1.a_id = w1.a_id
and q1.job_id = w1.job_id
Group by job_id;
That's all well and good, I'm not sure why we needed that "as q1" and "as w1", but i can see why it works.
And here is the solution I wrote:
SELECT job-id,COUNT(a-id) FROM job,qualified,wish WHERE (qualified.a-id=wish.a-id)
GROUP BY job-id
Why is my solution wrong? And also - From which table will it select the information? Suppose I write SELECT job-id FROM job,qualified,wish. From which table will it take the information? because job-id exists in all 3 of these tables.
You can only refer to tables mentioned in the FROM clause. If it's ambiguous (because more than one has a column of the same name) then you need to be explicit by qualifying the name. Usually the qualifier is an alias but it could also be the table name itself if an alias wasn't specified.
There's a concept of a "natural join" which joins tables on common column(s) between two tables. Not all systems support that notation but I think MySQL does. I believe these systems usually collapse the joined pairs into a single column.
select q1.job_id, count(q1.a_id) from qualified as q1, wish as w1
where q1.a_id = w1.a_id and q1.job_id = w1.job_id
group by job_id;
I don't think I've worked on any systems that would have accepted the query above because the grouping column would have been strictly unclear even though the intention really is not. So if it truly does work correctly on MySQL then my guess is that it recognizes the equivalence of the columns and cuts you some slack on the syntax.
By the way, your query appears to be incorrect because you only included a single column in a join that requires two columns. You also included a third table which means that your result will effectively do a cross join of every row in that table. The grouping is going to still going to reduce it to one row per job_id but the count is going to be multiplied by the number of rows in the job table. Perhaps you added that table thinking it would hurt to add it just in case you need it but that is not what it means at all.
Your query will list non-existing jobs in case the database has orphan records in applicant and qualified, and might also omit jobs that have no qualified and willing candidates.
I'm not exactly sure, because I have no idea if there's any database that will accept COUNT(a-id) when there's no information about the table from which to take this value.
edit: Interestingly it looks like both of these problems are shared by both of the solutions, but shawnt00 has a point: your solution makes a huge pointless cartesian of three tables: see it without the group by.
My current best guess for a working answer would therefore be http://sqlfiddle.com/#!9/09d0c/6

Best way to do a query with a large number of possible joins

On the project I'm working on we have an activity table and each activity can be linked to one of about 20 different "activity details" tables...
e.g. If the activity was of type "work", then it would have a corresponding activity_details_work record, if it was of type "sick leave" then it would have a corresponding activity_details_sickleave record and so on.
Currently we are loading the activities and then for each activity we have a separate query to go fetch the activity details from the relevant table. This obviously doesn't scale well if you have thousands of activities.
So my initial thought was to have a single query which fetches the activities and joins the details in one go e.g.
SELECT * FROM activity
LEFT JOIN activity_details_1_work ON ...
LEFT JOIN activity_details_2_sickleave ON ...
LEFT JOIN activity_details_3_travelwork ON ...
...etc...
LEFT JOIN activity_details_20_yearleave ON ...
But this will result in each record having 100's of fields, most of which are empty and that feels nasty.
Lazy-loading the details isn't really an option either as the details are almost always requested in the core logic, at least for the main types anyway.
Is there a super clever way of doing this that I'm not thinking of?
Thanks in advance
My suggestion is to define a view for each ActivityType, that is tailored specifically to that activity.
Then add an index on the Activity table lead by the ActivityType field. Cluster said index unless there is an overwhelming need for some other to be clustered (or performance benchmarking shows some other clustering selection to be more performant).
Is there a particular reason why this degree of denormalization was designed in? Is that reason well known?
Chances are your activity tables are like (date_from, date_to, with_who, descr) or something to that effect. As Pieter suggested, consider tossing in a type varchar or enum field in there, so as to deal with a single details table.
If there are rational reasons to keep the tables apart, consider adding triggers that maintain boolean/tinyint fields (has_work, has_sickleave, etc), or a bit string (has_activites_of_type where the first position amounts to has_work, the next to has_sickleave, etc.).
Either way, you'll probably be better off by fetching the activity's details in one or more separate queries -- if only to avoid field name collisions.
I don't think enum is the way to go, because as you say there might be 1000's of activities, then altering your activity table would become an issue.
There is no point doing a left join on a large number of tables either.
So the options that you have are :
See this The first comment might be useful.
I am guessing that your activity table has a field called activity_type_id.
Build a table called activity_types containing fields activity_type_id, activity_name, activity_details_table_name. First query in the following way
activity
inner join
activity_types
using( activity_type_id )
This query gives you the table name on which to query for the details.
This way you can add any new activity type just by adding a row in the activity_types table.

How to efficiently design MySQL database for my particular case

I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.

Store Voting Information - Database Outline

Summary: What is the most efficient way to store information similar to the like system on FB. Aka, a tally of likes is kept, the person who like it is kept etc.
It needs to be associated with a user id so as to know who liked it. The issue is, do you have a column that has a comma delimited list of the id of things that were liked, or do you have a separate column for each like (way too many columns). The info that's stored would be a boolean value (1/0) but needs to be associated with the user as well as the "page" that was liked.
My thought was this:
Column name = likes eg.:
1,2,3,4,5
Aka, the user has "like" the pages that have an id of 1, 2, 3, 4 and 5. To calculate total "likes" a tally would need to be taken and then stored in a database associated with the pages themselves (table already exists).
That seems the best way to me but is there a better option that anyone can think of?
P.S. I'm not doing FB likes but it's the easiest explanation.
EDIT: Similar idea to the plus/neg here on stackoverflow.
In this case the best way would be to create a new table to keep track of the likes. So supposing you have table posts, which has a column post_id which contains all the posts (on which the users can vote). And you have another table users with a column user_id, which contains all the users.
You should create a table likes which has at least two columns, something like like_postid and like_userid. Now, everytime a user likes a post create a new row in this table with the id of the post (the value of post_id from posts) that is liked and the id of the user (the value of user_id from users) that likes the post. Of course you can enter some more columns in the likes table (for instance to keep track of when a like is created).
What you have here is called a many-to-many relationship. Google it to get some more information about it and to find some more advice on how to implement them correctly (you will find that a comma seperated lists of id's will not be one of the best practices).
Update based on comments:
If I'm correct; you want to get a list of all users (ordered by name) who have voted on an artist. You should do that something like:
SELECT Artists.Name, User.Name
FROM Artists
JOIN Votes
ON Votes.page_ID = Artists.ID
JOIN Users
ON Votes.Votes_Userid = Users.User_ID
WHERE Artists.Name = "dfgdfg"
ORDER BY Users.Users_Name
There a strange thing here; the column in your Votes table which contains the artist id seems to be called page_ID. Also you're a bit inconsistent in column names (not really bad, but something to keep in mind if you want to be able to understand your code after leaving it alone for 6 months). In your comment you say that you only make one join, but you actually do two joins. If you specify two table names (like you do: JOIN Users, Votes SQL actually joins these two tables.
Based on the query you posted in the comments I can tell you haven't got much experience using joins. I suggest you read up on how to use them, it will really improve your ability to write good code.