Is it eficient to merge similar data objects into single table? - mysql

I need to store data a lot of similar data about my system of questions and the answer such as voting, following, bookmarks, etc.
In example of voting, what is the best table layout for storing votes for questions, answers, and posts?
Store the votes separately, that is, 3 tables are obtained: UserQuestionVotes, UserAnswerVotes and UserPostVotes
Store votes in one table:
UserVotes (id, user_id, item_id, item_type, vote),
while: item_id and item_type is the id and type of the question, answer or post, vote = -1/1
If I go the first way, I will have at least 9 tables.
And if I go the second way, that is, all the data in one heap, so in the future, when filling out the table, it will work more slowly.
Which way in my case eficient?

If you're looking for my opinion, I would pick door #1. Questions, Answers, and Posts are all separate, albeit related, "things." And, each of these "things" happen to also have "votes" associated with them ... but, really, a "vote" is not a "thing."
A "vote for a question" is tightly associated with "the question." "A vote for ..." anything else is the same. So now I start thinking about the queries I'm most likely to actually write. I'm most likely to want to write queries that, say, count how many votes a particular question has ... and I don't really want to muddy-up that query and make it either "hard to write" or obliged to look through a bunch of records that are not "votes for questions." The other types of votes wouldn't be relevant and I'd rather not have to filter them out. (If I need to write a query to count "how many votes for anything has this user cast?", I could very easily write that regardless.)
That's my opinion. (The database manager can take care of "efficiency" on its own. Design your database so that the queries you need to write are easy and clear to write.)

Related

Is having duplicate Database values better than querying more times?

Consider a table called users and a table called votes.
A user has an id and a country column.
Every vote belongs to a user, but the purpose when retrieveing the vote is to find out which country it came from. Therefore you would need to query once to get the vote, and query the users table after that to get country.
Considering a large, many-times queried database, Is it better to just add a country column for the votes table and have it be a duplicate for the one in users or to just use the method above?
Yes. No. Maybe.
The answer to your question depends on several things that you don't mention in the question. The first thing to note is that the query in VKP's answer is quite sufficient under most circumstances.
Second, if country is a full country name, then storing the full country name (which can be rather long) may greatly expand the size of the table. This increase in size may actually slow down certain queries, versus doing the join. Of course, this would be much less significant for 2- or 3- character codes or if the width of the records in votes is already several hundred bytes.
But, perhaps the most important consideration is whether you want the vote counted on the users current country or do you want the vote counted at the country assigned to the user when the vote was made? The first option says to always use a join to get the current value. The second is a very strong argument for including country in the votes table.
select v.vote_id, u.country
from users u join votes v
on u.id = v.userid
If you need to see the country from which a vote was, you can join the tables and get it. Also, it is not suggested to include a country column in the votes table as it doesn't make sense.
The way you have explained it, country is an attribute of user: user "lives in" or "is citizen of" a country. Vote is an action that may be taken by users: users cast votes.
How is it that you have a vote under consideration without already knowing the user? How was this vote selected in the first place? There must be some other detail(s) that you have omitted.
If you are searching for aggregate values ("How many votes were cast by Canadians during July?") then you have to join the tables anyway -- filtering on users only in Canada and votes only during July. A query for "In which countries did any citizens cast at least one vote in July?" would be trickier to code, but still requires a join.
The join needed by the latter question could be eliminated by duplicating the country to the Votes table. But I don't think any performance improvement would be significant and you must remember that you will have made your database a little more complicated, a little less maintainable and a little less robust. It would have to be quite a large performance boost to make all that worthwhile.

How to efficiently design MySQL database for my particular case

I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.

How to structure a categorized voting database?

I have a question about how to structure a DB. I have a reddit'esque voting system. Items can get votes. But each item belongs to a topic and each topic a category. While only items can get votes I'd like to be able to access the # of votes within a topic and within a category as well. Any suggestions on how to accomplish this?
I see 4 main ways of doing this:
De-normalize the votes and store the votes inside an attribute in the item table, the topic table, and the category table. I would then need to update all 3 whenever a vote / downvote occurs.
Create a separate 'vote' model. Votes belong to items, items to topics, and topics to categories. Then I can just query number of votes through the chain whenever I need to access anything.
Just have items and votes. Items would have a category and topic attribute.. then I'd query for items within a topic and count the votes on them..
Learn to use a NoSQL db system.
Extra info: I'm using Rails and I only really know MYSQL at the moment. Is this a time I should learn something like Mongo? Can this only really be accomplished with Hadoop? Can I accomplish this in MySQL. Thanks!
Create a separate 'vote' model. Votes belong to items, items to
topics, and topics to categories. Then I can just query number of
votes through the chain whenever I need to access anything.
That's the most flexible way to do what you're talking about.
Learn to use a NoSQL db system.
Not for your current project.
Is this a time I should learn something like Mongo?
No.
Can this only really be accomplished with Hadoop?
No. Any SQL database can do this. Whether any SQL database can manage whatever you're planning is a different question. Different platforms scale differently.
Can I accomplish this in MySQL.
Yes, easily.
I think you should got for option 2.
You need to create a vote model anyway, since you'll probably want to limit users to one vote on each item.
If you have performance issues later on, you can always cache the number of votes in an item, topic or category.
How you update those numbers should be carefully considered. A trigger on votes that auto-updates all the numbers above might cause too many write operations. Another way may be to run a statistics stored procedure periodically.
Anyway, the real point is - don't optimize until you know there's a problem.

Implementing Comments and Likes in database

I'm a software developer. I love to code, but I hate databases... Currently, I'm creating a website on which a user will be allowed to mark an entity as liked (like in FB), tag it and comment.
I get stuck on database tables design for handling this functionality. Solution is trivial, if we can do this only for one type of thing (eg. photos). But I need to enable this for 5 different things (for now, but I also assume that this number can grow, as the whole service grows).
I found some similar questions here, but none of them have a satisfying answer, so I'm asking this question again.
The question is, how to properly, efficiently and elastically design the database, so that it can store comments for different tables, likes for different tables and tags for them. Some design pattern as answer will be best ;)
Detailed description:
I have a table User with some user data, and 3 more tables: Photo with photographs, Articles with articles, Places with places. I want to enable any logged user to:
comment on any of those 3 tables
mark any of them as liked
tag any of them with some tag
I also want to count the number of likes for every element and the number of times that particular tag was used.
1st approach:
a) For tags, I will create a table Tag [TagId, tagName, tagCounter], then I will create many-to-many relationships tables for: Photo_has_tags, Place_has_tag, Article_has_tag.
b) The same counts for comments.
c) I will create a table LikedPhotos [idUser, idPhoto], LikedArticles[idUser, idArticle], LikedPlace [idUser, idPlace]. Number of likes will be calculated by queries (which, I assume is bad). And...
I really don't like this design for the last part, it smells badly for me ;)
2nd approach:
I will create a table ElementType [idType, TypeName == some table name] which will be populated by the administrator (me) with the names of tables that can be liked, commented or tagged. Then I will create tables:
a) LikedElement [idLike, idUser, idElementType, idLikedElement] and the same for Comments and Tags with the proper columns for each. Now, when I want to make a photo liked I will insert:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Photo'
INSERT (user id, typeId, photoId)
and for places:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Place'
INSERT (user id, typeId, placeId)
and so on... I think that the second approach is better, but I also feel like something is missing in this design as well...
At last, I also wonder which the best place to store counter for how many times the element was liked is. I can think of only two ways:
in element (Photo/Article/Place) table
by select count().
I hope that my explanation of the issue is more thorough now.
The most extensible solution is to have just one "base" table (connected to "likes", tags and comments), and "inherit" all other tables from it. Adding a new kind of entity involves just adding a new "inherited" table - it then automatically plugs into the whole like/tag/comment machinery.
Entity-relationship term for this is "category" (see the ERwin Methods Guide, section: "Subtype Relationships"). The category symbol is:
Assuming a user can like multiple entities, a same tag can be used for more than one entity but a comment is entity-specific, your model could look like this:
BTW, there are roughly 3 ways to implement the "ER category":
All types in one table.
All concrete types in separate tables.
All concrete and abstract types in separate tables.
Unless you have very stringent performance requirements, the third approach is probably the best (meaning the physical tables match 1:1 the entities in the diagram above).
Since you "hate" databases, why are you trying to implement one? Instead, solicit help from someone who loves and breathes this stuff.
Otherwise, learn to love your database. A well designed database simplifies programming, engineering the site, and smooths its continuing operation. Even an experienced d/b designer will not have complete and perfect foresight: some schema changes down the road will be needed as usage patterns emerge or requirements change.
If this is a one man project, program the database interface into simple operations using stored procedures: add_user, update_user, add_comment, add_like, upload_photo, list_comments, etc. Do not embed the schema into even one line of code. In this manner, the database schema can be changed without affecting any code: only the stored procedures should know about the schema.
You may have to refactor the schema several times. This is normal. Don't worry about getting it perfect the first time. Just make it functional enough to prototype an initial design. If you have the luxury of time, use it some, and then delete the schema and do it again. It is always better the second time.
This is a general idea
please donĀ“t pay much attention to the field names styling, but more to the relation and structure
This pseudocode will get all the comments of photo with ID 5
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "comment"
This pseudocode will get all the likes or users who liked photo with ID 5
(you may use count() to just get the amount of likes)
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "like"
as far as i understand. several tables are required. There is a many to many relation between them.
Table which stores the user data such as name, surname, birth date with a identity field.
Table which stores data types. these types may be photos, shares, links. each type must has a unique table. therefore, there is a relation between their individual tables and this table.
each different data type has its table. for example, status updates, photos, links.
the last table is for many to many relation storing an id, user id, data type and data id.
Look at the access patterns you are going to need. Do any of them seem to made particularly difficult or inefficient my one design choice or the other?
If not favour the one that requires the fewer tables
In this case:
Add Comment: you either pick a particular many/many table or insert into a common table with a known specific identifier for what is being liked, I think client code will be slightly simpler in your second case.
Find comments for item: here it seems using a common table is slightly easier - we just have a single query parameterised by type of entity
Find comments by a person about one kind of thing: simple query in either case
Find all comments by a person about all things: this seems little gnarly either way.
I think your "discriminated" approach, option 2, yields simpler queries in some cases and doesn't seem much worse in the others so I'd go with it.
Consider using table per entity for comments and etc. More tables - better sharding and scaling. It's not a problem to control many similar tables for all frameworks I know.
One day you'll need to optimize reads from such structure. You can easily create agragating tables over base ones and lose a bit on writes.
One big table with dictionary may become uncontrollable one day.
Definitely go with the second approach where you have one table and store the element type for each row, it will give you a lot more flexibility. Basically when something can logically be done with fewer tables it is almost always better to go with fewer tables. One advantage that comes to my mind right now about your particular case, consider you want to delete all liked elements of a certain user, with your first approach you need to issue one query for each element type but with the second approach it can be done with only one query or consider when you want to add a new element type, with the first approach it involves creating a new table for each new type but with the second approach you shouldn't do anything...

Guidelines for join/link/many to many tables

I have my own theories on the best way to do this, but I think its a common topic and I'd be interested in the different methods people use. Here goes
Whats the best way to deal with many-to-many join tables, particularly as far as naming them goes, what to do when you need to add extra information to the relationship, and what to do whene there are multiple relationships between two tables?
Lets say you have two tables, Users and Events and need to store the attendees. So you create EventAttendees table. Then a requirement comes up to store the organisers. Should you
create an EventOrganisers table, so each new relationship is modelled with a join table
or
rename EventAttendees to UserEventRelationship (or some other name, like User2Event or UserEventMap or UserToEvent), and an IsAttending column and a IsOrganiser column i.e. You have a single table which you store all relationship info between two attendees
or
a bit of both (really?)
or
something else entirely?
Thoughts?
The easy answer to a generic question like this is, as always, "It all depends on the details".
But in general, I try to create fewer tables when this can be done without abusing the data definitions unduly. So in your example, I would probably add an isOrganizer column to the table, or maybe an attendeeType to allow for easy future expansion from audience/organizer to audience/organizer/speaker/caterer or whatever may be needed. Creating an extra table with essentially identical columns, where the table name is in effect a flag identifying the "attendee type", seems to me the wrong way to go both from a pristine design perspective and also from a practical point of view.
A single table is more flexible. With one table and a type field, if we want to know just the organizers -- like when we're sending invitations to a planning meaning -- fine, we write "select userid from userevent where eventid=? and attendeetype='O'". If we want to know everyone who will be there -- like when we're printing name cards for the lunch tables -- we just don't include the attendeetype test.
But suppose we have two tables. Then if we want just the organizers, okay, that's easy, join on the organizer table. But if we want both organizers and audience, then we have to do a union, which makes for more complicated queries and is usually slow. And if you're thinking, What's the big deal doing a union?, note that there may be more to the query. Perhaps a person can have multiple phone numbers and we care about this, so the query is not just joining user and eventAttendee but also phone. Maybe we want to know if they've attended previous conferences because we give special deals to "alumni", so we have to join in eventAttendee a second time, etc etc. A ten-table join with a union can get very messy and confusing to read.