I'm building a small permission system, but unfortunately I'm no SQL expert by any means.
In this system I've decided to give all users a role and then assign specific permissions to the roles. My current database tables look like this:
My question is: What's the best way to check if a given User.id as a permission, by providing a Permission.permission_name value. I've come up with the following query:
SELECT EXISTS (
SELECT perm.id
FROM `User` userr
INNER JOIN `Role_Permission` connectionc
ON userr.role_id = connectionc.role_id
INNER JOIN `Permission` perm
ON connectionc.permission_id = perm.id
WHERE userr.id = 1
AND perm.permission_name LIKE 'doStuff'
) as userHasPermission
It works, but, from my understanding joining is expensive and that query is joining the content of 3 tables and then filtering what it needs.
Link to sqlfiddle: http://sqlfiddle.com/#!2/6ed7b/1
Thank you.
I don't think there's much place to optimise the query. From the real world scenario, no matter how big the user table is, role and permission table shouldn't exceed 3 digit and therefore, role_permission would not exceed 998001 records. If all the right columns are indexed properly, I believe, the sql will be quite fast (<0.1 sec). You can always check EXPLAIN do check if there's any bottlenecks.
(Off topic)
Alternatively, having worked on a similar project recently, there are few choices out there to improve speed fetching from 'finite' no. records.
Memory: You can choose to save all these relevant tables/data in memory (as opposed to disk) to minimise I/O related latency.
NoSQL: You can either choose a NoSQL solution like mongoDB and/or implement noSQL-like structure in MySql to eliminate Joins.
Redis: Arguably, the best solution if you'd like to think outside the box. Fastest of all.
I don't think there is much room for optimization, not without compromising the normalization of the database. Just make sure that you have the appropriate indexes in place.
Some alternatives would be:
Store the index name in the role permission table, thus requiring one less join. It will be not normalized, but this may be acceptable if permissions rarely change and you really need maximum performance.
Do not use integer ids for the permissions, instead, use their name as unique identifier. Then you don't need the table Permission at all, unless you need to add some attribute to them (but that would still allow you to check for a permission with only one join).
You should also consider how often do you need to run this query. Depending on you requirements, it may be acceptable to read all user permissions only when the user enters the system and store them on variables during the whole session; in this case you do not need so high a performance for the query. Or you could initially load not the permissions but the role, which would mean one less join on the query.
Related
I have a database with the following structure:
username,email,ip,hash,salt
Currently we have around 600.000 users in this database.
Users are complaining that querying this database is rather slow.
In our tests, we found that it takes around 1.15 seconds to retrieve a user record.
This test is based on the following query:
SELECT * FROM users WHERE email = 'test#mail.com'
I'm no expert in database management. I know how to get by when using it like a dictionary, however I have no idea on database optimization.
I was hoping I could get some help. Ideally, we'd be able to query the DB like this in under a second on even 10 million users.
Does anyone have any suggestion on optimizing simple queries like this? I'm open to anything right now, even restructuring the database if there's a more logical way to do it. Because right now, they're just ordered in the order that they registered with.
MySQL has two important facilities for improving performance. For your type of query, 500,000 rows or 10,000,000 rows is just not a big deal. Although other technologies such as NOSQL can perform the same actions, applications such as yours typically rely on the ACID properties of databases. A relational database is probably the right solution.
The first facility -- as mentioned elsewhere -- are indexes. In your case:
create index idx_users_email on users(email);
An index will incur a very small amount of overhead for insert and delete operations. However, with the index, looking up a row should go down to well under 0.1 seconds -- even with concurrent queries.
Depending on other queries you are running other indexes may be appropriate.
The second important capability is partitioning the tables. This is not necessary for a users table. However, it can be quite useful for transactions and other types of data.
you could add an index as already mentioned in the comments, but one thought present itself - you are currently retrieving ALL info for that row - it would be more efficient to target the query to only retrieve that information which is necessary - such as
SELECT username FROM users WHERE email = 'test#mail.com';
also - you should investigate PDO and bound parameters for security.
Let's say I wanted to make a database that could be used to keep track of bank accounts and transactions for a user. A database that can be used in a Checkbook application.
If i have a user table, with the following properties:
user_id
email
password
And then I create an account table, which can be linked to a certain user:
account_id
account_description
account_balance
user_id
And to go the next step, I create a transaction table:
transaction_id
transaction_description
is_withdrawal
account_id // The account to which this transaction belongs
user_id // The user to which this transaction belongs
Is having the user_id in the transaction table a good option? It would make the query cleaner if I wanted to get all the transactions for each user, such as:
SELECT * FROM transactions
JOIN users ON users.user_id = transactions.user_id
Or, I could just trace back to the users table from the account table
SELECT * FROM transactions
JOIN accounts ON accounts.account_id = transactions.account_id
JOIN users ON users.user_id = accounts.user_id
I know the first query is much cleaner, but is that the best way to go?
My concern is that by having this extra (redundant) column in the transaction table, I'm wasting space, when I can achieve the same result without said column.
Let's look at it from a different angle. From where will the query or series of queries start? If you have customer info, you can get account info and then transaction info or just transactions-per-customer. You need all three tables for meaningful information. If you have account info, you can get transaction info and a pointer to customer. But to get any customer info, you need to go to the customer table so you still need all three tables. If you have transaction info, you could get account info but that is meaningless without customer info or you could get customer info without account info but transactions-per-customer is useless noise without account data.
Either way you slice it, the information you need for any conceivable use is split up between three tables and you will have to access all three to get meaningful information instead of just a data dump.
Having the customer FK in the transaction table may provide you with a way to make a "clean" query, but the result of that query is of doubtful usefulness. So you've really gained nothing. I've worked writing Anti-Money Laundering (AML) scanners for an international credit card company, so I'm not being hypothetical. You're always going to need all three tables anyway.
Btw, the fact that there are FKs in the first place tells me the question concerns an OLTP environment. An OLAP environment (data warehouse) doesn't need FKs or any other data integrity checks as warehouse data is static. The data originates from an OLTP environment where the data integrity checks have already been made. So there you can denormalize to your hearts content. So let's not be giving answers applicable to an OLAP environment to a question concerning an OLTP environment.
You should not use two foreign keys in the same table. This is not a good database design.
A user makes transactions through an account. That is how it is logically done; therefore, this is how the DB should be designed.
Using joins is how this should be done. You should not use the user_id key as it is already in the account table.
The wasted space is unnecessary and is a bad database design.
In my opinion, if you have simple Many-To-Many relation just use two primary keys, and that's all.
Otherwise, if you have Many-To-Many relation with extra columns use one primary key, and two foreign keys. It's easier to manage this table as single Entity, just like Doctrine do it. Generally speaking simple Many-To-Many relations are rare, and they are usefull just for linking two tables.
Denormalizing is usually a bad idea. In the first place it is often not faster from a performance standard. What it does is make the data integrity at risk and it can create massive problems if you end up changing from a 1-1 relationship to a 1-many.
For instance what is to say that each account will have only one user? In your table design that is all you would get which is something I find suspicious right off the bat. Accounts in my system can have thousands of users. SO that is the first place I question your model. Did you actually think interms of whether the realtionships woudl be 1-1 or 1-many? Or did you just make an asssumpltion? Datamodels are NOT easy to adjust after you have millions of records, you need to do far more planning for the future in database design and far more thinking about the data needs over time than you do in application design.
But suppose you have one-one relationship now. And three months after you go live you get a new account where they need to have 3 users. Now you have to rememeber all the places you denornmalized in order to properly fix the data. This can create much confusion as inevitably you will forget some of them.
Further even if you never will need to move to a more robust model, how are you going to maintain this if the user_id changes as they are going to do often. Now in order to keep the data integrity, you need to have a trigger to maintain the data as it changes. Worse, if the data can be changed from either table you could get conflicting changes. How do you handle those?
So you have created a maintenance mess and possibly risked your data intergrity all to write "cleaner" code and save yourself all of ten seconds writing a join? You gain nothing in terms of things that are important in database development such as performance or security or data integrity and you risk alot. How short-sighted is that?
You need to stop thinking in terms of "Cleaner code" when developiong for databases. Often the best code for a query is the most complex appearing as it is the most performant and that is critical for databases. Don't project object-oriented coding techniques into database developement, they are two very differnt things with very differnt needs. You need to start thinking in terms of how this will play out as the data changes which you clearly are not doing or you would not even consider doing such a thing. You need to think more of thr data meaning and less of the "Principles of software development" which are taught as if they apply to everything but in reality do not apply well to databases.
It depends. If you can get the data fast enough, used the normalized version (where user_id is NOT in the transaction table). If you are worried about performance, go ahead and include user_ID. It will use up more space in the database by storing redundant information, but you will be able to return the data faster.
EDIT
There are several factors to consider when deciding whether or not to denormalize a data structure. Each situation needs to be considered uniquely; no answer is sufficient without looking at the specific situation (hence the "It depends" that begins this answer). For the simple case above, denormalization would probably not be an optimal solution.
I am designing a system which has a database for storing users and information related to the users. More specifically each user in the table has very little information. Something like Name, Password, uid.
Then each user has zero or more containers, and the way I've initially done this is to create a second table in the database which holds containers and have a field referencing the user owning it. So something like containerName, content, owner.
So a query on data from a container would look something like:
SELECT content
FROM containers
WHERE (containerName='someContainer' AND owner='someOwner');
My question is if this is a good way, I am thinking scalability say that we have thousands of users with say... 5 containers each (however each user could have a different number of containers, but 5 would probably be a typical case). My concern is that searching through the database will become slow when there is 5 entries out of 5*1000 entries I could ever want in one query. (We may typically only want a specific container's content from our query and we are looking into the database with basically a overhead of 4995 entries, am I right? And what happen if I subscribed a million users, it would become a huge table which just intuitively feel like a bad idea.
A second take on it which I had would be to have tables per user, however that doesn't feel like a very good solution either since that would give me 1000 tables in the database which (also by intuition) seem like a bad way to do it.
Any help in understanding how to design this would be greatly appreciated, I hope it's all clear and easy to follow.
The accepted way of handling this is by creating an INDEX on the owner field. That way, MySQL optimized queries for owner = 'some value' conditions.
See also: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
You're right in saying that a 1000 tables is not scalable. Once you start reaching a few million records you might want to consider doing sharding (split up records into several locations based on user attributes) ... but by that time you'd already be quite successful I think ;-)
If it is an RBMS(like Oracle / MySQL) datbase, you can create indexes on columns that are frequently queried to optimize the table traversal and query. Indexes are automatically created for PRIMARY and (optionally for) FOREIGN keys.
I have a table with all registered members, with columns like uid, username, last_action_time.
I also have a table that keeps track of who has been online in the past 5 minutes. It is populated by a cronjob by pulling data from members with last_action_time being less than 5 minutes ago.
Question: Should my online table include username or no? I'm asking this because I could JOIN both tables to obtain this data, but I could store the username in the online table and not have to join. My concern is that I will have duplicate data stored in two tables, and that seems wrong.
If you haven't run into performance issues, DO NOT denormalize. There is a good saying "normalize until it hurts, denormalize until it works". In your case, it works with normalized schema (users table joined). And data bases are designed to handle huge amounts of data.
This approach is called denormalization. I mean that sometimes for quick select query we have to duplicate some data across tables. In this case I believe that this one is good choice if you have a lot of data in both tables.
You just hit a very valid question: when does it make sense to duplicate data ?
I could rewrite your question as: when does it make sense to use a cache. Caches need maintenance, you need to keep them up to date yourself and they use up some extra space (although negligible in this case). But they have a pro: performance increase.
In the example you mentioned, you need to see if that performance increase is actually worth it and if it outweighs the additional work of having and maintaining a cache.
My gut feeling is that your database isn't gigantic, so joining every time should take a minimal amount of effort from the server, so I'd go with that.
Hope it helps
You shouldn't store the username in the online table. There shouldn't be any performance issue . Just use a join every time to get the username.
Plus, you don't need the online table at all, why don't you query only the users with an last_action_time < 5 min from the members table?
A user ID would be an integer (AKA 4 bytes). A username (i would imagine is up to 16 bytes). How many users? How ofter a username changes? These are the questions to consider.
I wold just store the username. I wou;ld have though once the username is registered it is fixed for the duration.
If is difficult to answer these questions without a little background - performance issues are difficult to think about when the depth and breath, usabge etc. is not known.
What is the most efficient method of managing blocked users for each user so they don't appear in search results on a PHP/MySQL-run site?
This is the way I am currently doing it and I have a feeling this is not the most efficient way:
Create a BLOB for each user on their main user table that gets updated with the unique User ID's of each user they block. So if User ID's 313, 563, and 732 are blocked by a user, their BLOB simply contains "313,563,732". Then, whenever a search result is queried for that user, I include the BLOB contents like so "AND UserID NOT IN (313,563,732)" so that the blocked User ID's don't show up for that user. When a user "unblocks" someone, I remove that User ID from their BLOB.
Is there a better way of doing this (I'm sure there is!)? If so, why is it better and what are the pros and cons of your suggestion?
Thanks, I appreciate it!
You are saving relationships in a relational database in a way that it does not understand. You will not have the benefit of foreign keys etc.
My recommended way to do this would be to have a seperate table for the blocked users:
create table user_blocked_users (user_id int, blocked_user_id);
Then when you want to filter the search result, you can simply do it with a subquery:
select * from user u where ?searcherId not in (select b.blocked_user_id from user_blocked_users where b.user_id = u.id)
You may want to start out that way, and then optimize it with queries, caches or other things if neccessary - but do it last. First, do a consistent and correct data model that you can work with.
Some of the pros of this approach:
You will have a correct data model
of your block relations
With foreign keys, you will keep your data model consistent
The cons of this approach:
In your case, none that I can see
The cons of your approach:
It will be slow and not scalable, as blobs are searched binarily and not indexed
Your data model will be hard to maintain and you will not have the benefit of foreign keys
You are looking for a cross reference table.
You have a table containing user IDs and "Blocked" user IDs, then you SELECT blockid FROM blocked WHERE uid=$user and you have a list of user ids that are blocked, which you can filter through a where clause such as WHERE uid NOT IN(SELECT blockid FROM blocked WHERE uid=$user)
Now you can block multiple users per user, and the other way round, with all the speed of an actual database.
You are looking for a second table joined in a many-to-many relationship. Check this post:
Many-to-Many Relationships in MySQL
The "Pros" are numerous. You are handling your data with referential integrity, which has incalculable benefits down the road. The issue you described will be followed by others in your application, and some of those others will be more unmanageable than this one.
The "Cons" are that
You will have have to learn how referential data works (but that's ahead anyway, as I say)
You will have more tables to deal with (ditto)
You will have to learn more about CRUD, which is difficult ... but, just part of the package.
What you are currently using is not regarded as a good practice for relational database design, however, like with anything else, there are cases when that approach can be justified, albeit restrictive in terms of what you can accomplish.
What you could do is, like J V suggested, create a cross reference table that contains mappings of user relationships. This allows you to, among other things, skip unnecessary queries, make use of table indexes and possibly most importantly, it gives you far greater flexibility in the future.
For instance, you can add a field to the table that indicates the type/status of the relationship (ie. blocked, friend, pending approval etc.) which would allow a much more complex system to be developed easily.