MongoDB -- Sub sub children? - mysql

In an app like StackOverflow (nothing really related but it's a good example), we have Questions. Each Question in our mongodb collection has an array of Answers, and an array of Votes (either upvoting or downvoting).
In each Answer, it is possible for a User to upvote or downvote it.
So the schema would look something like this:
Question
-> Answers []
-> Votes []
-> value (-1/1)
-> username
-> Votes []
-> value (-1/1)
-> username
-> question_text, etc.
Coming from a MySQL background this schema feels "icky" but I've been assured it's an industry practice.
Now I'm currently required to show each User what they have voted on, both in terms of Questions and Answers.
So if I have to find the Answer that a user has voted on, I would query thusly (in node):
question_collection.find(
{'question.answers.votes.username': username}, function(e,d) {
/* do stuff */}
);
That query goes 4 levels deep. Is this a normal way to do it or should I introduce some normalization in the schema?

One of the strengths of MongoDB is that you can put all relevant information about a Question inside one document, so you need only 1 database query and no joins to get all information to render the Question.
However, if you want to find everything a user has voted on, things get a bit more complicated. You can do what you do, certainly, although it won't win any performance awards.
Alternatively, you can duplicate the data in a smart way to access it in an easier way. For example, you could add two arrays to the User model:
QuestionsVoted: [{ id1: +1}, {id2: -1}, {id3: +1}],
AnswersVoted: [{ id4: +1}, {id5: -1}]
This means you need to keep this data in sync: when a user votes on a question or answer, you need to update both the question and the user. This is not that bad, because the data is written rarely and read often.
If you have other requirements that deal with votes themselves, for example statistics of votes over time, or by geographical region, you might want to create a Vote collection. And yes, you would have to keep the data in sync across 3 collections then.
It is technically possible to create foreign keys in MongoDB, but I would not recommend it in this case. You would lose some of the benefits of MongoDB, because it is not good at joins (it would require 2 separate queries).
You can read more about how to design relationships in MongoDB on their blog.

Related

Database choice for Multiple Questions and Answer

I'm about to develop a database used for storing questionnaire with a indefinite number of questions. This database will communicate with a server, used for registering the answers and retrieving them when needed. For each user, I need to store the answer to each question, which is server-side mandatory.
The structure should be like that:
QUESTIONNAIRE:
Questionnaire n.1 (Question1,Question2,...,QuestionN)
Questionnaire n.2 (Question1,...,QuestionM)
...
USER:
User1
User2
...
ANSWERS
User1 - Questionnaire n.1 (AnswToQuestion1, AnswToQuestion2...AnswToQuestionN)
User2 - Questionnaire n.1 (AnswToQuestion1, AnswToQuestion2...AnswToQuestionN)
User2 - Questionnaire n.2 (AnswToQuestion1, AnswToQuestion2...AnswToQuestionM)
Given the specifications above, I face 2 major problems:
Indefinite number of questions means that a table must store each individual questions for each questionnaire
There should be a corrispondence between question and answer
I thought about 2 different solution for facing such problems, but I can really decide between the 2.
The first one would consist in creating 5 differents table, as follows:
The idea is basically to store users in USER table, and for each user store an ANSWER which refers to a specific questionnaire. This gives an information like 'user x has replied to questionnaire'. Then, the unique id id:ANSWER would be used to mark a given answer in MARKETINGANSWER. The QUESTIONNAIRE table would save each instance of a questionnaire, and provide an unique id to link each different question in the MARKETINGQUESTION table. At the end, the unique id id:MARKETINGQUESTION would be used to map a question to a given answer.
What does not convince me: this has too many layers of indirection. I do not like the forced connection of MARKETINGANSWER and MARKETINGQUESTION, yet I need a way to map them. Moreover, I need the general table for ANSWER, since it provides some general info about the questionnaire (the hidden columns).
Then I thougth about a second approach: creating an unique ANSWER and QUESTIONNAIRE table which would envelop in a TEXT attribute the series of answers/questions separated by a special escape character (such as #&, for instance). This would ruin the mapping, but it can be implemented in the web application - I was thinking about a control on the number of question/answer, the ordering would be ensured on questionnaire registration and commit on the database. Moreover, the schema would be cleaner. Yet, I would lose a consistent representation in the database schema.
I would like, if possible, some suggestion about the proposed solutions. I am pretty sure there is something simple I cannot quite grasp.
Thank you in advance.

Is it eficient to merge similar data objects into single table?

I need to store data a lot of similar data about my system of questions and the answer such as voting, following, bookmarks, etc.
In example of voting, what is the best table layout for storing votes for questions, answers, and posts?
Store the votes separately, that is, 3 tables are obtained: UserQuestionVotes, UserAnswerVotes and UserPostVotes
Store votes in one table:
UserVotes (id, user_id, item_id, item_type, vote),
while: item_id and item_type is the id and type of the question, answer or post, vote = -1/1
If I go the first way, I will have at least 9 tables.
And if I go the second way, that is, all the data in one heap, so in the future, when filling out the table, it will work more slowly.
Which way in my case eficient?
If you're looking for my opinion, I would pick door #1. Questions, Answers, and Posts are all separate, albeit related, "things." And, each of these "things" happen to also have "votes" associated with them ... but, really, a "vote" is not a "thing."
A "vote for a question" is tightly associated with "the question." "A vote for ..." anything else is the same. So now I start thinking about the queries I'm most likely to actually write. I'm most likely to want to write queries that, say, count how many votes a particular question has ... and I don't really want to muddy-up that query and make it either "hard to write" or obliged to look through a bunch of records that are not "votes for questions." The other types of votes wouldn't be relevant and I'd rather not have to filter them out. (If I need to write a query to count "how many votes for anything has this user cast?", I could very easily write that regardless.)
That's my opinion. (The database manager can take care of "efficiency" on its own. Design your database so that the queries you need to write are easy and clear to write.)

Sql table or Mongo document structure design for complex data structure

I have a requirement which is build a question survey system.
Simply say, it need question, predefine answer and user's answers record.
Question need a question id, question text
Answer need a answer id, answer text
User's answer record need a record id, user id, question id, answer id,date,os,ip,browser info, is live
For user record, I need to keep all history, that's why I need a "is live" column. So only the latest answer for each user is true. When user answer a same question again, all exist answer record for this user will be history( is live = false ).
Seems simple structure. But When I got more than 100,000 question, more than 1 million users, and each user for each question have more than 20 answer record, then the records are more than 100,000 * 1,000,000 * 20 = 2,000,000,000,000 records. Then it become a big issue.
I also need to describe how I need to use this data. I need to provide another system, which can use user's record to target a group of users by define a question answer criteria. For example:
(Q1=A1 && Q2=A3 && Q3=A5 && (Q4=A8 || Q5=A9)) criteria 1
(Q1!=A1 && Q2=A3) criteria 2
(Q4=A8 || Q5!=A9) criteria 3
After I define the criteria:
I need to provide one api to get all user ids who match a criteria (api1)
I need to provide one api to get all criterias for a user (api2)
The api need fast and live. And api will be called frequently.
So just imagine when there are 200,000,000,000 records in one table. The api call will be very slow or even kill the db.
So, I have some solution which is not good, I just list here so we can discuss:
Each question got a single table to save all user record for this question.
Each user got a single table to save all question record for this user.
Both 1 and 2
But I can see there solution is not very good and efficient. So want to discuss about it here. Doesn't matter what kind of technology (sql, nosql, hadoop etc...)
Please put you thoughts here.
I would try with mongoDB using only one "user" collection storing answers in arrays:
{userId: 1,
name: "nick",
...,
"answers": [
{ questionId:1,
answerId: 1,
date: Date(...),
...,
isLive: 1},
{ questionId:1
answerId: 2,
date: Date(...),
...,
isLive: 0}
]
}
Then I would use a Multikey Index on the property "answers.isLive" to ensure high speed access to "live" answers.
Another multikey index on "answers.questionId" and "answers.answerId" should ensure high performances retrieving data with your criteria.
With number like yours I would take into consideration sharding your collection from the start.

Implementing Comments and Likes in database

I'm a software developer. I love to code, but I hate databases... Currently, I'm creating a website on which a user will be allowed to mark an entity as liked (like in FB), tag it and comment.
I get stuck on database tables design for handling this functionality. Solution is trivial, if we can do this only for one type of thing (eg. photos). But I need to enable this for 5 different things (for now, but I also assume that this number can grow, as the whole service grows).
I found some similar questions here, but none of them have a satisfying answer, so I'm asking this question again.
The question is, how to properly, efficiently and elastically design the database, so that it can store comments for different tables, likes for different tables and tags for them. Some design pattern as answer will be best ;)
Detailed description:
I have a table User with some user data, and 3 more tables: Photo with photographs, Articles with articles, Places with places. I want to enable any logged user to:
comment on any of those 3 tables
mark any of them as liked
tag any of them with some tag
I also want to count the number of likes for every element and the number of times that particular tag was used.
1st approach:
a) For tags, I will create a table Tag [TagId, tagName, tagCounter], then I will create many-to-many relationships tables for: Photo_has_tags, Place_has_tag, Article_has_tag.
b) The same counts for comments.
c) I will create a table LikedPhotos [idUser, idPhoto], LikedArticles[idUser, idArticle], LikedPlace [idUser, idPlace]. Number of likes will be calculated by queries (which, I assume is bad). And...
I really don't like this design for the last part, it smells badly for me ;)
2nd approach:
I will create a table ElementType [idType, TypeName == some table name] which will be populated by the administrator (me) with the names of tables that can be liked, commented or tagged. Then I will create tables:
a) LikedElement [idLike, idUser, idElementType, idLikedElement] and the same for Comments and Tags with the proper columns for each. Now, when I want to make a photo liked I will insert:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Photo'
INSERT (user id, typeId, photoId)
and for places:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Place'
INSERT (user id, typeId, placeId)
and so on... I think that the second approach is better, but I also feel like something is missing in this design as well...
At last, I also wonder which the best place to store counter for how many times the element was liked is. I can think of only two ways:
in element (Photo/Article/Place) table
by select count().
I hope that my explanation of the issue is more thorough now.
The most extensible solution is to have just one "base" table (connected to "likes", tags and comments), and "inherit" all other tables from it. Adding a new kind of entity involves just adding a new "inherited" table - it then automatically plugs into the whole like/tag/comment machinery.
Entity-relationship term for this is "category" (see the ERwin Methods Guide, section: "Subtype Relationships"). The category symbol is:
Assuming a user can like multiple entities, a same tag can be used for more than one entity but a comment is entity-specific, your model could look like this:
BTW, there are roughly 3 ways to implement the "ER category":
All types in one table.
All concrete types in separate tables.
All concrete and abstract types in separate tables.
Unless you have very stringent performance requirements, the third approach is probably the best (meaning the physical tables match 1:1 the entities in the diagram above).
Since you "hate" databases, why are you trying to implement one? Instead, solicit help from someone who loves and breathes this stuff.
Otherwise, learn to love your database. A well designed database simplifies programming, engineering the site, and smooths its continuing operation. Even an experienced d/b designer will not have complete and perfect foresight: some schema changes down the road will be needed as usage patterns emerge or requirements change.
If this is a one man project, program the database interface into simple operations using stored procedures: add_user, update_user, add_comment, add_like, upload_photo, list_comments, etc. Do not embed the schema into even one line of code. In this manner, the database schema can be changed without affecting any code: only the stored procedures should know about the schema.
You may have to refactor the schema several times. This is normal. Don't worry about getting it perfect the first time. Just make it functional enough to prototype an initial design. If you have the luxury of time, use it some, and then delete the schema and do it again. It is always better the second time.
This is a general idea
please don´t pay much attention to the field names styling, but more to the relation and structure
This pseudocode will get all the comments of photo with ID 5
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "comment"
This pseudocode will get all the likes or users who liked photo with ID 5
(you may use count() to just get the amount of likes)
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "like"
as far as i understand. several tables are required. There is a many to many relation between them.
Table which stores the user data such as name, surname, birth date with a identity field.
Table which stores data types. these types may be photos, shares, links. each type must has a unique table. therefore, there is a relation between their individual tables and this table.
each different data type has its table. for example, status updates, photos, links.
the last table is for many to many relation storing an id, user id, data type and data id.
Look at the access patterns you are going to need. Do any of them seem to made particularly difficult or inefficient my one design choice or the other?
If not favour the one that requires the fewer tables
In this case:
Add Comment: you either pick a particular many/many table or insert into a common table with a known specific identifier for what is being liked, I think client code will be slightly simpler in your second case.
Find comments for item: here it seems using a common table is slightly easier - we just have a single query parameterised by type of entity
Find comments by a person about one kind of thing: simple query in either case
Find all comments by a person about all things: this seems little gnarly either way.
I think your "discriminated" approach, option 2, yields simpler queries in some cases and doesn't seem much worse in the others so I'd go with it.
Consider using table per entity for comments and etc. More tables - better sharding and scaling. It's not a problem to control many similar tables for all frameworks I know.
One day you'll need to optimize reads from such structure. You can easily create agragating tables over base ones and lose a bit on writes.
One big table with dictionary may become uncontrollable one day.
Definitely go with the second approach where you have one table and store the element type for each row, it will give you a lot more flexibility. Basically when something can logically be done with fewer tables it is almost always better to go with fewer tables. One advantage that comes to my mind right now about your particular case, consider you want to delete all liked elements of a certain user, with your first approach you need to issue one query for each element type but with the second approach it can be done with only one query or consider when you want to add a new element type, with the first approach it involves creating a new table for each new type but with the second approach you shouldn't do anything...

Database schema for ACL

I want to create a schema for a ACL; however, I'm torn between a couple of ways of implementing it.
I am pretty sure I don't want to deal with cascading permissions as that leads to a lot of confusion on the backend and for site administrators.
I think I can also live with users only being in one role at a time. A setup like this will allow roles and permissions to be added as needed as the site grows without affecting existing roles/rules.
At first I was going to normalize the data and have three tables to represent the relations.
ROLES { id, name }
RESOURCES { id, name }
PERMISSIONS { id, role_id, resource_id }
A query to figure out whether a user was allowed somewhere would look like this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
Then I realized that I will only have about 20 resources, each with up to 5 actions (create, update, view, etc..) and perhaps another 8 roles. This means that I can exercise blatant disregard for data normalization as I will never have more than a couple of hundred possible records.
So perhaps a schema like this would make more sense.
ROLES { id, name }
PERMISSIONS { id, role_id, resource_name }
which would allow me to lookup records in a single query
SELECT * FROM permissions WHERE role_id = ? AND permission = ? ($user_role_id, 'post.update')
So which of these is more correct? Are there other schema layouts for ACL?
In my experience, the real question mostly breaks down to whether or not any amount of user-specific access-restriction is going to occur.
Suppose, for instance, that you're designing the schema of a community and that you allow users to toggle the visibility of their profile.
One option is to stick to a public/private profile flag and stick to broad, pre-emptive permission checks: 'users.view' (views public users) vs, say, 'users.view_all' (views all users, for moderators).
Another involves more refined permissions, you might want them to be able to configure things so they can make themselves (a) viewable by all, (b) viewable by their hand-picked buddies, (c) kept private entirely, and perhaps (d) viewable by all except their hand-picked bozos. In this case you need to store owner/access-related data for individual rows, and you'll need to heavily abstract some of these things in order to avoid materializing the transitive closure of a dense, oriented graph.
With either approach, I've found that added complexity in role editing/assignment is offset by the resulting ease/flexibility in assigning permissions to individual pieces of data, and that the following to worked best:
Users can have multiple roles
Roles and permissions merged in the same table with a flag to distinguish the two (useful when editing roles/perms)
Roles can assign other roles, and roles and perms can assign permissions (but permissions cannot assign roles), from within the same table.
The resulting oriented graph can then be pulled in two queries, built once and for all in a reasonable amount of time using whichever language you're using, and cached into Memcache or similar for subsequent use.
From there, pulling a user's permissions is a matter of checking which roles he has, and processing them using the permission graph to get the final permissions. Check permissions by verifying that a user has the specified role/permission or not. And then run your query/issue an error based on that permission check.
You can extend the check for individual nodes (i.e. check_perms($user, 'users.edit', $node) for "can edit this node" vs check_perms($user, 'users.edit') for "may edit a node") if you need to, and you'll have something very flexible/easy to use for end users.
As the opening example should illustrate, be wary of steering too much towards row-level permissions. The performance bottleneck is less in checking an individual node's permissions than it is in pulling a list of valid nodes (i.e. only those that the user can view or edit). I'd advise against anything beyond flags and user_id fields within the rows themselves if you're not (very) well versed in query optimization.
This means that I can exercise blatant
disregard for data normalization as I
will never have more than a couple
hundred possible records.
The number of rows you expect isn't a criterion for choosing which normal form to aim for.
Normalization is concerned with data integrity. It generally increases data integrity by reducing redundancy.
The real question to ask isn't "How many rows will I have?", but "How important is it for the database to always give me the right answers?" For a database that will be used to implement an ACL, I'd say "Pretty danged important."
If anything, a low number of rows suggests you don't need to be concerned with performance, so 5NF should be an easy choice to make. You'll want to hit 5NF before you add any id numbers.
A query to figure out if a user was
allowed somewhere would look like
this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions
WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
That you wrote that as two queries instead of using an inner join suggests that you might be in over your head. (That's an observation, not a criticism.)
SELECT p.*
FROM permissions p
INNER JOIN resources r ON (r.id = p.resource_id AND
r.name = ?)
You can use a SET to assign the roles.
CREATE TABLE permission (
id integer primary key autoincrement
,name varchar
,perm SET('create', 'edit', 'delete', 'view')
,resource_id integer );