SQL or NoSQL search? - mysql

Let us suppose I have a site with a certain number of users with the following three distinguishing characteristics:
1) The user is part of a network. (The site contains multiple networks.)
2) The user is a 'contact' of a certain number of other site members.
3) Individual documents uploaded by a user may be shared with certain contacts (excluding other contacts).
In this way, a user's document search is unique for each user based upon his or her network, contacts, and additional documents that have been shared with that user. What would be possible ways to address this -- would I need to append a long unique SQL query for each user for each of his or her searches? I am currently using MySQL as a database -- would using this be sufficient, or would I need to move towards a NoSQL option here to maintain the performance of a similar non-filtered search?

A few questions come to mind to help answer this question:
How many documents do you think the average user will have access to? Will many documents in the network be shared for all to see?
How will users be able to find documents and what do the documents look like? Will they only be able to search by the contact that shared it? By a simple title match? Will they be able to run a full text search against the document's contents?
Depending on the answer to those two questions, a relational system could work just fine, which I'm guessing is preferable since you are already using MySql. I think you could locate the documents for an individual user in a relational system with a few very reasonable queries.
Here is a potential bare bones schema
User
--all users in the system
UserId int
NetworkId int (Not sure if this is a 1 to many relationship)
Document
--all documents in the system
DocumentId int
UserId int -- the author
Name varchar
StatusId -- perhaps a flag to indicate whether it is public or not, e.g. shared with everyone in the same network or shared with all contacts
UserDocumentLink
--Linking between a document and the contacts a user has shared the document with
DocumentId
ContactId
UserContact
--A link between a user and all of their contacts
ContactId -- PK identity to represent a link between two users
UserId -- User who owns the contact
ContactUserId --The contact user
Here is a potential "search" query:
--documents owned by me
SELECT DocumentId
from Document where UserId = #userId
UNION
--documents shared with me explicitly
SELECT DocumentId
From UserContact uc
InnerJoin UserDocumentLink ucl on uc.ContactId = ucl.ContactId
Where
uc.ContactUserId = #userId
UNION
--documents shared with me via some public status, using a keyword filter
Select DocumentId
From Document d
inner join User u on d.UserId = u.UserId
where
u.NetworkId = #userNetworkId
and d.status in ()
and d.Name like '%' + #keyword + '%'
I think what might be a more influential requirement for schema design is one that is not mentioned in your question - how will users be able to search through documents? And what kind of documents are we talking about here? MySql is not a good option for full text search.

It rather depends on what you mean by a "certain number" of users. If you mean a few tens of thousands, then almost any solution can be made to perform adequately. If you mean many millions, then a NoSQL solution may scale up more cheaply and easily.
I suspect that a more general SQL query can be used, rather than a unique one for each user, e.g. selecting documents that belong to users that know the current user, that are marked as being shared with the current user, and match the search string.
Denormalisation can probably be used (as is common in NoSQL approaches) to improve performance.
However, a graph database (as Peter Neubauer suggests) possibly in combination with a document store (CouchDB, MongoDB or Cassandra) would work very well for this type of problem and would scale well.

I would take a look at some of the NOSQL solutions, for this interconnected dataset possibly Neo4j, a Graph Database. It's even pretty straightforward to query it through Cypher so that you get tabular results back.

As others have pointed out the number of users and the frequency of requests (traffic volume) must be looked at. Also, how important is redundancy? How likely are people to work on same documents simultaneously? Are most documents created once and distributed for "readonly" purposes?
NoSQL can help you scale and get redundancy in a much easier way compared to rdbms for this particular scenario. I am assuming that at some point you will want tagging etc. to be enabled on the documents.
Now, I am wondering if there is any particular reason why you are not looking at off the shelf document management and CMS system for this? I am sure there is a good reason, but it might be worth looking at all the those options too.
I hope this helps. Good luck!

Denormalization will give you better read-search performance in this
case.
Don't normalize users, keep frequently joined entities like owner and
text, in one table
e.g. keep names of the owners as FK on text table, to keep their
names on the text table and decrease number of joins, then you can
use sql freely.

I've managed this using long unique queries in MySQL as you suggest for a small-scale social networking project. Nowadays I would suggest using solr and keeping permission information as a denormalized array of interchangeable keywords on each document. Say each network has a unique recognizable code (ie 100N-20000N), similar for users and special permission grants. You can store an array of permission keys, like "5515N 43243N 2342N 603U 203PG 44321PG" and treat those as keywords when searching.

I would address it with a simple business process solution, which will lead to a simple data schema, a simple query and so performances and scalabilty:
Each User has a list of documents... Period.
This list is in fact a list of references to documents in a document table (with owner/security informations...)
When sharing a document to another user this document reference is added to the user's document list (Tagged as a shared one if you want), user is added to the document security list (with permission level for example).
sql query to get documents is a simple: select documentid from userdocument where userid=#userid
With a join on document table, proper indexes and sql tuning it will run with all needed informations and it will run fast.
I hope i understood well what you try to do.

-< = one to many
>-< = many to many (will require link table)
Network -< user -< documents >-< contact(user)
v
|
^
contacts(user,user)
This is relational, I don't see a good reason to go NoSQL unless you have a billion users
Network (unless you can belong to more than one) is an attribute of user
contacts will be maitained in the link table user_contact(user,user)
tables
documents(doc_id,user_id)
user(user_id)
contacts(user_id,c_user_id) with foreign keys on users
document_contact(doc_id,c_user_id) where a trigger constrains the c_user_id
then you get a view for all docs owners and subscribers (contacts)
CREATE OR REPLACE VIEW user_docs AS
SELECT d.user_id, d.doc_id, 'owner' AS role
FROM documents d
JOIN users u ON d.user_id = u.user_id
UNION
SELECT c.user_id, d.doc_id, 'subscriber' AS role
FROM documents d
JOIN contacts c ON d.user_id = c.c_user_id;
you can then filter the view against the document contacts,
select * from user_docs ud
where
(ud.role = 'originator'
or
ud.doc_id in (select doc_id from document_contact dc where ud.doc_id = dc.doc_id)
) and ud.user_id = 'me'

I would trade off immediateness with performance when it comes to full text searching.
I would create a hash table of the user combinations with the documents on a separate thread usually triggered by an asynchronous call when user associations change.
I then query the hash value + other search criteria. This will eliminate the need for the long SQL that appears at the end which may cause a lock.

Related

sql query to check many interests are matched

So I am building a swingers site. The users can search other users by their interests. This is only part of a number of parameters used to search a user. The thing is there are like 100 different interests. When searching another user they can select all the interests the user must share. While I can think of ways to do this, I know it is important the search be as efficient as possible.
The backend uses jdbc to connect to a mysql database. Java is the backend programming language.
I have debated using multiple columns for interests but generating the thing is the sql query need not check them all if those columns are not addressed in the json object send to the server telling it the search criteria. Also I worry i may have to make painful modifications to the table at a later point if i add new columns.
Another thing I thought about was having some kind of long byte array, or number (used like a byte array) stored in a single column. I could & this with another number corresponding to the interests the user is searching for but I read somewhere this is actually quite inefficient despite it making good sense to my mind :/
And all of this has to be part of one big sql query with multiple tables joined into it.
One of the issues with me using multiple columns would be the compiting power used to run statement.setBoolean on what could be 40 columns.
I thought about generating an xml string in the client then processing that in the sql query.
Any suggestions?
I think the correct term is a Bitmask. I could maybe have one table for the bitmask that maps the users id to the bitmask for querying users interests, and another with multiple entries for each interest per user id for looking up which user has which interests efficiently if I later require this?
Basically, it would be great to have a separate table with all the interests, 2 columns: id and interest.
Then, have a table that links the user to the interests: user_interests which would have the following columns: id,user_id,interest_id. Here some knowledge about many-to-many relations would help a lot.
Hope it helps!

Database Design: User Profiles like in Meetup.com

In Meetup.com, when you join a meetup group, you are usually required to complete a profile for that particular group. For example, if you join a movie meetup group, you may need to list the genres of movies you enjoy, etc.
I'm building a similar application, wherein users can join various groups and complete different profile details for each group. Assume the 2 possibilities:
Users can create their own groups and define what details to ask users that join that group (so, something a bit dynamic -- perhaps suggesting that at least an EAV design is required)
The developer decides now which groups to create and specify what details to ask users who join that group (meaning that the profile details will be predefined and "hard coded" into the system)
What's the best way to model such data?
More elaborate example:
The "Movie Goers" group request their members to specify the following:
Name
Birthdate (to be used to compute member's age)
Gender (must select from "male" or "female")
Favorite Genres (must select 1 or more from a list of specified genres)
The "Extreme Sports" group request their member to specify the following:
Name
Description of Activities Enjoyed (narrative form)
Postal Code
The bottom line is that each group may require different details from members joining their group. Ideally, I would like anyone to create a group (ala MeetUp.com). However, I also need the ability to query for members fairly well (e.g. find all women movie goers between the ages of 25 and 30).
For something like this....you'd want maximum normalization, so you wouldn't have duplicate data anywhere. Because your user-defined tables could possibly contain the same type of record, I think that you might have to go above 3NF for this.
My suggestion would be this - explode your tables so that you have something close to 6NF with EAV, so that each question that users must answer will have its own table. Then, your user-created tables will all reference one of your question tables. This avoids the duplication of data issue. (For instance, you don't want an entry in the "MovieGoers" group with the name "John Brown" and one in the "Extreme Sports" group with the name "Johnny B." for the same user; you also don't want his "what is your favorite color" answer to be "Blue" in one group and "Red" in another. Any data that can span across groups, like common questions, would be normalized in this form.)
The main drawback to this is that you'd end up with a lot of tables, and you'd probably want to create views for your statistical queries. However, in terms of pure data integrity, this would work well.
Note that you could probably get away with only factoring out the common fields, if you really wanted to. Examples of common fields would include Name, Location, Gender, and others; you could also do the same for common questions, like "what is your favorite color" or "do you have pets" or something to that extent. Group-specific questions that don't span across groups could be stored in a separate table for that group, un-exploded. I wouldn't advise this because it wouldn't be as flexible as the pure 6NF option and you run the risk of duplication (how do you predetermine which questions won't be common questions?) but if you really wanted to, you could do this.
There's a good question about 6NF here: Would like to Understand 6NF with an Example
I hope that made some sense and I hope it helps. If you have any questions, leave a comment.
Really, this is exactly a problem for which SQL is not a right solution. Forget normalization. This is exactly the job for NoSQL document stores. Every user as a document, having some essential fields like id, name, pwd etc. And every group adds possibility to add some fields. Unique fields can have names group-id-prefixed, shared fields (that grasp some more general concept) can have that field name free.
Except users (and groups) then you will have field descriptions with name, type, possible values, ... which is also very good for a document store.
If you use key-value document store from the beginning, you gain this freeform possibility of structuring your data plus querying them (though not by SQL, but by the means this or that NoSQL database provides).
First i'd like to note that the following structure is just a basis to your DB and you will need to expand/reduce it.
There are the following entities in DB:
user (just user)
group (any group)
template (list of requirement united into template to simplify assignment)
requirement (single requirement. For example: date of birth, gender, favorite sport)
"Modeling":
**User**
user_id
user_name
**Group**
name
group_id
user_group
user_id (FK)
group_id (FK)
**requirement**:
requirement_id
requirement_name
requirement_type (FK) (means the type: combo, free string, date) - should refers to dictionary)
**template**
template_id
template_name
**template_requirement**
r_id (FK)
t_id (FK)
The next step is to model appropriate schema for storing restrictions, i.e. validating rule for any requirement in any template. We have to separate it because for different groups the same restrictions can be different (for example: "age"). You can use the following table:
**restrictions**
group_id
template_id
requirement_id (should be here as template_id because the same requirement can exists in different templates and any group can consists of many templates)
restriction_type (FK) (points to another dict: value, length, regexp, at_least_one_value_choosed and so on)
So, as i said it is the basis. You can feel free to simplify this schema (wipe out tables, multiple templates for group). Or you can make it more general adding opportunity to create and publish temaplate, requirements and so on.
Hope you find this idea useful
You could save such data as JSON or XML (Structure, Data)
User Table
Userid
Username
Password
Groups -> JSON Array of all Groups
GroupStructure Table
Groupid
Groupname
Groupstructure -> JSON Structure (with specified Fields)
GroupData Table
Userid
Groupid
Groupdata -> JSON Data
I think this covers most of your constraints:
users
user_id, user_name, password, birth_date, gender
1, Robert Jones, *****, 2011-11-11, M
group
group_id, group_name
1, Movie Goers
2, Extreme Sports
group_membership
user_id, group_id
1, 1
1, 2
group_data
group_data_id, group_id, group_data_name
1, 1, Favorite Genres
2, 2, Favorite Activities
group_data_value
id, group_data_id, group_data_value
1,1,Comedy
2,1,Sci-Fi
3,1,Documentaries
4,2,Extreme Cage Fighting
5,2,Naked Extreme Bike Riding
user_group_data
user_id, group_id, group_data_id, group_data_value_id
1,1,1,1
1,1,1,2
1,2,2,4
1,2,2,5
I've had similar issues to this. I'm not sure if this would be the best recommendation for your specific situation but consider this.
Provide a means of storing data as XML, or JSON, or some other format that delimits the data, but basically stores it in field that has no specific format.
Provide a way to store the definition of that data
Provide a lookup/index table for the data.
This is a combination of techniques indicated already.
Essentially, you would create some interface to your clients to create a "form" for what they want saved. This form would indicated what pieces of information they want from the user. It would also indicate what pieces of information you want to search on.
Save this information to the definition table.
The definition table is then used to describe the user interface for entering data.
Once user data is entered, save the data (as xml or whatever) to one table with a unique id. At the same time, another table will be populated as an index with
id where the xml data was saved
name of field data is stored in
value of field data stored.
id of data definition.
now when a search commences, there should be no issue in searching for the information in the index table by name, value and definition id and getting back the id of the xml/json (or whatever) data you stored in the table that the data form was stored.
That data should be transformable once it is retrieved.
I was seriously sketchy on the details here, I hope this is enough of an answer to get you started. If you would like any explanation or additional details, let me know and I'll be happy to help.
if you're not stuck to mysql, i suggest you to use postgresql which provides build-in array datatypes.
you can define a define an array of varchar field to store group specific fields, in your groups table. to store values you can do the same in the membership table.
comparing to string parsing based xml types, this array approach will be really fast.
if you dont like array approach you can check out xml datatypes and an optional hstore datatype which is a key-value store.

Database schema for ACL

I want to create a schema for a ACL; however, I'm torn between a couple of ways of implementing it.
I am pretty sure I don't want to deal with cascading permissions as that leads to a lot of confusion on the backend and for site administrators.
I think I can also live with users only being in one role at a time. A setup like this will allow roles and permissions to be added as needed as the site grows without affecting existing roles/rules.
At first I was going to normalize the data and have three tables to represent the relations.
ROLES { id, name }
RESOURCES { id, name }
PERMISSIONS { id, role_id, resource_id }
A query to figure out whether a user was allowed somewhere would look like this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
Then I realized that I will only have about 20 resources, each with up to 5 actions (create, update, view, etc..) and perhaps another 8 roles. This means that I can exercise blatant disregard for data normalization as I will never have more than a couple of hundred possible records.
So perhaps a schema like this would make more sense.
ROLES { id, name }
PERMISSIONS { id, role_id, resource_name }
which would allow me to lookup records in a single query
SELECT * FROM permissions WHERE role_id = ? AND permission = ? ($user_role_id, 'post.update')
So which of these is more correct? Are there other schema layouts for ACL?
In my experience, the real question mostly breaks down to whether or not any amount of user-specific access-restriction is going to occur.
Suppose, for instance, that you're designing the schema of a community and that you allow users to toggle the visibility of their profile.
One option is to stick to a public/private profile flag and stick to broad, pre-emptive permission checks: 'users.view' (views public users) vs, say, 'users.view_all' (views all users, for moderators).
Another involves more refined permissions, you might want them to be able to configure things so they can make themselves (a) viewable by all, (b) viewable by their hand-picked buddies, (c) kept private entirely, and perhaps (d) viewable by all except their hand-picked bozos. In this case you need to store owner/access-related data for individual rows, and you'll need to heavily abstract some of these things in order to avoid materializing the transitive closure of a dense, oriented graph.
With either approach, I've found that added complexity in role editing/assignment is offset by the resulting ease/flexibility in assigning permissions to individual pieces of data, and that the following to worked best:
Users can have multiple roles
Roles and permissions merged in the same table with a flag to distinguish the two (useful when editing roles/perms)
Roles can assign other roles, and roles and perms can assign permissions (but permissions cannot assign roles), from within the same table.
The resulting oriented graph can then be pulled in two queries, built once and for all in a reasonable amount of time using whichever language you're using, and cached into Memcache or similar for subsequent use.
From there, pulling a user's permissions is a matter of checking which roles he has, and processing them using the permission graph to get the final permissions. Check permissions by verifying that a user has the specified role/permission or not. And then run your query/issue an error based on that permission check.
You can extend the check for individual nodes (i.e. check_perms($user, 'users.edit', $node) for "can edit this node" vs check_perms($user, 'users.edit') for "may edit a node") if you need to, and you'll have something very flexible/easy to use for end users.
As the opening example should illustrate, be wary of steering too much towards row-level permissions. The performance bottleneck is less in checking an individual node's permissions than it is in pulling a list of valid nodes (i.e. only those that the user can view or edit). I'd advise against anything beyond flags and user_id fields within the rows themselves if you're not (very) well versed in query optimization.
This means that I can exercise blatant
disregard for data normalization as I
will never have more than a couple
hundred possible records.
The number of rows you expect isn't a criterion for choosing which normal form to aim for.
Normalization is concerned with data integrity. It generally increases data integrity by reducing redundancy.
The real question to ask isn't "How many rows will I have?", but "How important is it for the database to always give me the right answers?" For a database that will be used to implement an ACL, I'd say "Pretty danged important."
If anything, a low number of rows suggests you don't need to be concerned with performance, so 5NF should be an easy choice to make. You'll want to hit 5NF before you add any id numbers.
A query to figure out if a user was
allowed somewhere would look like
this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions
WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
That you wrote that as two queries instead of using an inner join suggests that you might be in over your head. (That's an observation, not a criticism.)
SELECT p.*
FROM permissions p
INNER JOIN resources r ON (r.id = p.resource_id AND
r.name = ?)
You can use a SET to assign the roles.
CREATE TABLE permission (
id integer primary key autoincrement
,name varchar
,perm SET('create', 'edit', 'delete', 'view')
,resource_id integer );

MySQL database model for signups with and without addresses

I've been thinking about this all evening (GMT) but I can't seem to figure out a good solution for this one. Here's the case...
I have to create a signup system which distinguishes 4 kinds of "users":
Individual sign ups (require address info)
Group sign ups (don't require address info)
Group contact (require address info)
Application users (don't require address info)
I really cannot come up with a decent way of modeling this into something that makes sense. I'd greatly appreciate it if you could share your ideas.
Thanks in advance!
Sounds like good case for single table inheritance
Requiring certain data is more a function of your application logic than your database. You can definitely define database columns that don't allow NULL values, but they can be set to "" (empty string) without any errors.
As far as how to structure your database, have two separate tables:
User
UserAddress
When you have a new signup that requires contact info, your application will create records in both tables. When a new signup doesn't require address info, your application will only create a record in the User table.
There are a couple considerations here: first, I like to look at User/Group as a case of a Composite pattern. It clearly meets the requirement: you often have to treat the aggregate and individual versions of the entity interchangeably (as you note). Implementing a composite in a database is not that hard. If you are using an ORM, it is pretty simple (inheritance).
On the other part of the question, you always have the ability to create data structures that are mostly empty. Generally, that's a bad idea. So you can say 'well, in the beginning, we don't have any information about the User so we will just leave all the other fields blank.' A better approach is to try and model the phases as if they were part of an FSM. One of the clearest ways to do this in this particular case is to distinguish between Users, Accounts and some other more domain-specific entity, e.g. Subscriber or Customer. Then, I can come and browse using User, sign up and make an Account, then later when you want address and other personal information, become a subscriber. This would also imply inheritance, and you have the added benefit of being able to have a true representation of the population at any time that doesn't require stupid shenanigans like 'SELECT COUNT(*) WHERE _ not null,' etc.
Here's a suggestion from my end after weighing pro's and con's on this model. As I think the ideal setup is to have all users be a user entity that belong to a group without differentiating groups from individuals (except of course flag a group contact person and creating a link with a groups table) we came up with the alternative to copy the group contact user details to the group members when they group is created.
This way all entities that actually are a person will get their own table.
Could this be a good idea? Awaiting your comments :)
I've decided to go with a construction where group members are separated from the user pool anyway. The group members eventually have no relation with a user since they don't require access to mutating their personal data, that's what a group contact person is for. Eventually I could add a possibility for groups to have multiple contact persons, even distinguishing persons that are or are not allowed to edit any member data.
That's my answer on this one.

Permissions for web site users

I'm working on a web site where each user can have multiple roles/permissions such as basic logging in, ordering products, administrating other users, and so on. On top of this, there are stores, and each store can have multiple users administrating it. Each store also has it's own set of permissions.
I've confused myself and am not sure how best to represent this in a db. Right now I'm thinking:
users
roles
users_roles
stores
stores_users
But, should I also have stores_roles and stores_users_roles tables to keep track of separate permissions for the stores or should I keep the roles limited to a single 'roles' table?
I originally thought of having only a single roles table, but then what about users who have roles in multiple stores? I.e., if a user is given a role of let's say 'store product updating' there would need to be some method of determining which store this is referring to. A stores_users_roles table could fix this by having a store_id field, thus a user could have 'store product updating' and 'store product deletion' for store #42 and only 'store product updating' for store #84.
I hope I'm making sense here.
Edit
Thanks for the info everyone. Apparently I have some thinking to do. This is simply a fun project I'm working on, but RBAC has always been something that I wanted to understand better.
This is probably obvious to you by now, but role based access control is hard. My suggestion is, don't try to write your own unless you want that one part to take up all the time you were hoping to spend on the 'cool stuff'.
There are plenty of flexible, thoroughly-tested authorization libraries out there implementing RBAC (sometimes mislabeled as ACL), and my suggestion would be to find one that suits your needs and use it. Don't reinvent the wheel unless you are a wheel geek.
It seems likely to me that if I have permission to do certain roles in a set of stores, then I would probably have the same permissions in each store. So having a single roles table would probably be sufficient. So "joe" can do "store product updating" and "store product deletion", then have a user_stores table to list which stores he has access to. The assumption is for that entire list, he would have the same permissions in all stores.
If the business rules are such that he could update and delete in one store, but only update, no delete, in another store, well then you'll have to get more complex.
In my experience you'll usually be told that you need a lot of flexibility, then once implemented, no one uses it. And the GUI gets very complex and makes it hard to administer.
If the GUI does get complex, I suggest you look at it from the point of view of the store as well as the point of view of the user. In other words, instead of selecting a user, then selecting what permissions they have, and what stores they can access, it may be simpler to first select a store, then select which users have access to which roles in that store. Depends I guess on how many users and how many stores. In a past project I found it far easier to do it one way than the other.
Your model looks ok to me. The only modification I think you need is as to the granularity of the Role. Right now, your role is just an operation.
But first, you need a store_role table, a joint table resolving the Many-to-many relationship b/w a role and a store. ie, one store can have many roles and one role can be done in many stores.
Eg: StoreA can CREATE, UPDATE, DELETE customer. and DELETE customer can be done in StoreA, StoreB and StoreC.
Next, you can freely associate users to store_role_id in the user_store_roles table.
Now, a user_store_role record will have a user_id and a store_role_id:
A collection of
SELECT * FROM USER_STORE_ROLE WHERE user_id = #userID
returns all permitted operations of the user in all the stores.
For a collection of users's roles in a particular store, do an inner join of the above to user_store table adding a WHERE part of like
where STORE_ROLE.store_id = #storeID
Put a store_id in the user_roles table.
If this is Rails, the user model would have_many :stores, :through => :roles