Shiro + Keys stored as strings = failure point or unnecessary concern? - mysql

Not getting into too many specifics, this is a high level question.
I've always gone by the idea that it is never a good idea to store primary keys in places that does not have a constraint. For example, storing a primary key in a EAV style architecture ("USER_ID",144). If that user is ever deleted it will not be reflected in the EAV mapping and cause issues farther down the road.
So I'm creating a new application using shiro as a security/permission framework and I need users to be able to edit themselves but not other users, I also need other users to be able to edit anyone. Simple enough:
user1 = "user:441:edit"
user2 = "user:edit"
In addition, I could have someone that can only edit a subset of users, something like this
user3 = "user:459:edit","user:460:edit","user:461:edit"
Or, someone that can edit users that are in a department but only that department
user4 = "department:5898:user:edit"
If someone from user3's list is deleted there's no way to update that user's permissions without magic (going through all the permissions and finding the ones to remove).
Now I don't plan to reset the keys but if it was to ever happen and I DON'T clean up the old keys I could have users suddenly being able to edit users who were recently created after reusing the old keys.
I could mitigate some of this by using a generated code ("user:ciS84nFSHK:edit") that is unique across ALL TABLES to manage deletion of permissions. However, adding in a few hundred million records makes me think this can grow unwieldy quickly.
Am I using Shiro improperly? Am I simply overfocused on keys getting mangled? Have you solved these issues? Any help would be appreciated.

After doing some more research with Shrio I've decided to go with pst's suggestion and simply reference the values via a GUID (maybe a 10-digit-alphanumeric instead to preserve space).
So user1 = "user:441:edit" would become "user:96aae854-fc40-42ff-b948-c8944c2fca92:edit" this will help negate the concern about storing keys as strings.

Related

What's a suitable table design for objects which can be "trashed"

I'm designing a simple media "server" as part of a larger application. I've chosen to adopt similar terminology as the AWS S3 service, i.e Objects and Buckets (i.e files and directories).
I have two tables:
cdn_bucket
id, directory
and
cdn_object
id, bucket_id, filename, is_deleted
Other tables in the database can include objects using a foreign key on cdn_object.id. This has nice side-effects in that I can specify a constraint to set the field NULL in the event that the object is deleted (or indeed prevent deletion if needed). e.g:
blog_post
id, title, body, featured_image
CONSTRAINT: featured_image = cdn_object.id ON DELETE SET NULL
I was told once that I shouldn't delete things, ever (that's an argument for another post, please don't comment on it here); hence the is_deleted flag. To clarify the question, this is what I mean by "trashed", i.e recoverable.
This works great, however I can't leverage the cascading functionality of the constraints (i.e I mark an object as deleted, but the referring table, e.g blog_post.featured_image references the old ID).
I was wondering what the SO opinions might be on the following two approaches, or if there's another approach which might be better.
1. Join the cdn_object table
SELECT bp.*, cdno.id featured_image FROM blog_post bp JOIN cdn_object cdno ON cdno.id = bp.featured_image AND cdno.is_deleted = 0.
Pro: easy to implement.
Con: every query has to join the cdn_object table.
or
2. Use a trash table
Have another table, cdn_object_trash and have the code 'move' the row cdn_object when it's deleted, triggering all the cascading constraints.
Pro: allows the relational rules to do what they were designed to do
Con: bad by design? Not sure.
My gut feeling tells me I should use the is_deleted flag and write code accordingly, but this is a generic class and so I'd prefer to not force the developer to write the join every time if I can configure that logic in the DB.
I hope my situation/question is clear, please ask me to clarify any points if needed.
Your third option is to set up a reasonable backup and retention schedule, and use cascading deletes. While I understand the desire to "never delete anything", abiding by that principle is forcing you to be redundant in your programming choices (option 1) or to figure out how to build a trash table to redundantly store information (option 2; do you build a single table with a string representation of the data, or do you make a trash copy of the schema?). Both of those choices seem like a lot of work to maintain (over the long haul).
I've worked with variants of both choices, and if those were the only options on the table, option 1 is a bit easier to maintain; however, you have to be EXTREMELY diligent in using it, and you have to make sure that future development efforts live up to that same standard.

Using Redis as a Key/Value store for activity stream

I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.
1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)

Database schema for ACL

I want to create a schema for a ACL; however, I'm torn between a couple of ways of implementing it.
I am pretty sure I don't want to deal with cascading permissions as that leads to a lot of confusion on the backend and for site administrators.
I think I can also live with users only being in one role at a time. A setup like this will allow roles and permissions to be added as needed as the site grows without affecting existing roles/rules.
At first I was going to normalize the data and have three tables to represent the relations.
ROLES { id, name }
RESOURCES { id, name }
PERMISSIONS { id, role_id, resource_id }
A query to figure out whether a user was allowed somewhere would look like this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
Then I realized that I will only have about 20 resources, each with up to 5 actions (create, update, view, etc..) and perhaps another 8 roles. This means that I can exercise blatant disregard for data normalization as I will never have more than a couple of hundred possible records.
So perhaps a schema like this would make more sense.
ROLES { id, name }
PERMISSIONS { id, role_id, resource_name }
which would allow me to lookup records in a single query
SELECT * FROM permissions WHERE role_id = ? AND permission = ? ($user_role_id, 'post.update')
So which of these is more correct? Are there other schema layouts for ACL?
In my experience, the real question mostly breaks down to whether or not any amount of user-specific access-restriction is going to occur.
Suppose, for instance, that you're designing the schema of a community and that you allow users to toggle the visibility of their profile.
One option is to stick to a public/private profile flag and stick to broad, pre-emptive permission checks: 'users.view' (views public users) vs, say, 'users.view_all' (views all users, for moderators).
Another involves more refined permissions, you might want them to be able to configure things so they can make themselves (a) viewable by all, (b) viewable by their hand-picked buddies, (c) kept private entirely, and perhaps (d) viewable by all except their hand-picked bozos. In this case you need to store owner/access-related data for individual rows, and you'll need to heavily abstract some of these things in order to avoid materializing the transitive closure of a dense, oriented graph.
With either approach, I've found that added complexity in role editing/assignment is offset by the resulting ease/flexibility in assigning permissions to individual pieces of data, and that the following to worked best:
Users can have multiple roles
Roles and permissions merged in the same table with a flag to distinguish the two (useful when editing roles/perms)
Roles can assign other roles, and roles and perms can assign permissions (but permissions cannot assign roles), from within the same table.
The resulting oriented graph can then be pulled in two queries, built once and for all in a reasonable amount of time using whichever language you're using, and cached into Memcache or similar for subsequent use.
From there, pulling a user's permissions is a matter of checking which roles he has, and processing them using the permission graph to get the final permissions. Check permissions by verifying that a user has the specified role/permission or not. And then run your query/issue an error based on that permission check.
You can extend the check for individual nodes (i.e. check_perms($user, 'users.edit', $node) for "can edit this node" vs check_perms($user, 'users.edit') for "may edit a node") if you need to, and you'll have something very flexible/easy to use for end users.
As the opening example should illustrate, be wary of steering too much towards row-level permissions. The performance bottleneck is less in checking an individual node's permissions than it is in pulling a list of valid nodes (i.e. only those that the user can view or edit). I'd advise against anything beyond flags and user_id fields within the rows themselves if you're not (very) well versed in query optimization.
This means that I can exercise blatant
disregard for data normalization as I
will never have more than a couple
hundred possible records.
The number of rows you expect isn't a criterion for choosing which normal form to aim for.
Normalization is concerned with data integrity. It generally increases data integrity by reducing redundancy.
The real question to ask isn't "How many rows will I have?", but "How important is it for the database to always give me the right answers?" For a database that will be used to implement an ACL, I'd say "Pretty danged important."
If anything, a low number of rows suggests you don't need to be concerned with performance, so 5NF should be an easy choice to make. You'll want to hit 5NF before you add any id numbers.
A query to figure out if a user was
allowed somewhere would look like
this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions
WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
That you wrote that as two queries instead of using an inner join suggests that you might be in over your head. (That's an observation, not a criticism.)
SELECT p.*
FROM permissions p
INNER JOIN resources r ON (r.id = p.resource_id AND
r.name = ?)
You can use a SET to assign the roles.
CREATE TABLE permission (
id integer primary key autoincrement
,name varchar
,perm SET('create', 'edit', 'delete', 'view')
,resource_id integer );

Perl MySQL - How do I skip updating or inserting a row if a particular field matches?

I am pretty new to this so sorry for my lack of knowledge.
I set up a few tables which I have successfully written to and and accessed via a Perl script using CGI and DBI modules thanks to advice here.
This is a member list for a local band newsletter. Yeah I know, tons of apps out there but, I desire to learn this.
1- I wanted to avoid updating or inserting a row if an piece of my input matches column data in one particular column/field.
When creating the table, in phpmyadmin, I clicked the "U" (unique) on that columns name in structure view.
That seemed to work and no dupes are inserted but, I desire a hard coded Perl solution so, I understand the mechanics of this.
I read up on "insert ignore" / "update ignore" and searched all over but, everything I found seems to not just skip a dupe.
The column is not a key or autoinc just a plain old field with an email address. (mistake?)
2- When I write to the database, I want to do NOTHING if the incoming email address matches one in that field.
I desire the fastest method so I can loop through their existing lists export data, (they cannot figure out the software) with no racing / locking issues or whatever conditions in which I am in obvious ignorance.
Since I am creating this from scratch, 1 and 2 may be in fact partially moot. If so, what would be the best approach?
I would still like an auto increment ID so, I can access via the ID number or loop through with some kind of count++ foreach.
My stone knife approach may be laughable to the gurus here but, I need to start somewhere.
Thanks in advance for your assistance.
With the email address column declared UNIQUE, INSERT IGNORE is exactly what you want for insertion. Sounds like you already know how to do the right thing!
(You could perform the "don't insert if it already exists" functionality in perl, but it's difficult to get right, because you have to wrap the test and update in a transaction. One of the big advantages of a relational database is that it will perform constraint checks like this for you, ensuring data integrity even if your application is buggy.)
For updating, I'm not sure what an "update ignore" would look like. What is in the WHERE clause that is limiting your UPDATE to only affect the 1 desired row? Perhaps that auto_increment primary key you mentioned? If you are wanting to write, for example,
UPDATE members SET firstname='Sue' WHERE member_id = 5;
then I think this "update ignore" functionality you want might just be something like
UPDATE members SET firstname='Sue' WHERE member_id = 5
AND email != 'sue#example.com';
which is an odd thing to do, but that's my best guess for what you might mean :)
Just do the insert, if data would make the unique column not be unique you'll get an SQL error, you should be able to trap this and do whatever is appropriate (e.g. ignore it, log it, alert user ...)

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!