What's a suitable table design for objects which can be "trashed" - mysql

I'm designing a simple media "server" as part of a larger application. I've chosen to adopt similar terminology as the AWS S3 service, i.e Objects and Buckets (i.e files and directories).
I have two tables:
cdn_bucket
id, directory
and
cdn_object
id, bucket_id, filename, is_deleted
Other tables in the database can include objects using a foreign key on cdn_object.id. This has nice side-effects in that I can specify a constraint to set the field NULL in the event that the object is deleted (or indeed prevent deletion if needed). e.g:
blog_post
id, title, body, featured_image
CONSTRAINT: featured_image = cdn_object.id ON DELETE SET NULL
I was told once that I shouldn't delete things, ever (that's an argument for another post, please don't comment on it here); hence the is_deleted flag. To clarify the question, this is what I mean by "trashed", i.e recoverable.
This works great, however I can't leverage the cascading functionality of the constraints (i.e I mark an object as deleted, but the referring table, e.g blog_post.featured_image references the old ID).
I was wondering what the SO opinions might be on the following two approaches, or if there's another approach which might be better.
1. Join the cdn_object table
SELECT bp.*, cdno.id featured_image FROM blog_post bp JOIN cdn_object cdno ON cdno.id = bp.featured_image AND cdno.is_deleted = 0.
Pro: easy to implement.
Con: every query has to join the cdn_object table.
or
2. Use a trash table
Have another table, cdn_object_trash and have the code 'move' the row cdn_object when it's deleted, triggering all the cascading constraints.
Pro: allows the relational rules to do what they were designed to do
Con: bad by design? Not sure.
My gut feeling tells me I should use the is_deleted flag and write code accordingly, but this is a generic class and so I'd prefer to not force the developer to write the join every time if I can configure that logic in the DB.
I hope my situation/question is clear, please ask me to clarify any points if needed.

Your third option is to set up a reasonable backup and retention schedule, and use cascading deletes. While I understand the desire to "never delete anything", abiding by that principle is forcing you to be redundant in your programming choices (option 1) or to figure out how to build a trash table to redundantly store information (option 2; do you build a single table with a string representation of the data, or do you make a trash copy of the schema?). Both of those choices seem like a lot of work to maintain (over the long haul).
I've worked with variants of both choices, and if those were the only options on the table, option 1 is a bit easier to maintain; however, you have to be EXTREMELY diligent in using it, and you have to make sure that future development efforts live up to that same standard.

Related

How to set up relational database tables for this many-to-many relationship?

I have a type of data called a chain. Each chain is made up of a specific sequence of another type of data called a step. So a chain is ultimately made up of multiple steps in a specific order. I'm trying to figure out the best way to set this up in MySQL that will allow me to do the following:
Look up all steps in a chain, and get them in the right order
Look up all chains that contain a step
I'm currently considering the following table set up as the appropriate solution:
TABLE chains
id date_created
TABLE steps
id description
TABLE chains_steps (this would be used for joins)
chain_id step_id step_position
In the table chains_steps, the step_position column would be used to order the steps in a chain correctly. It seems unusual for a JOIN table to contain its own distinct piece of data, such as step_position in this case. But maybe it's not unusual at all and I'm just inexperienced/paranoid.
I don't have much experience in all this so I wanted to get some feedback. Are the three tables I suggested the correct way to do this? Are there any viable alternatives and if so, what are the advantages/drawback?
You're doing it right.
Consider a database containing the Employees and Projects tables, and how you'd want to link them in a many-to-many fashion. You'd probably come up with an Assignments table (or Project_Employees in some naming conventions).
At some point you'd decide you want not only to store each project assignment, but you'd also want to store when the assignment started, and when it finished. The natural place to put that is in the assignment itself; it doesn't make sense to store it either with the project or with the employee.
In further designs you might even find it necessary to store further information about the assignment, for example in an employee review process you may wish to store feedback related to their performance in that project, so you'd make the assignment the "one" end of a relationship with a Review table, which would relate back to Assignments with a FK on assignment_id.
So in short, it's perfectly normal to have a junction table that has its own data.
That looks fine, and it's not unusual for the join table to contain a position/rank field.
Look up all steps in a chain, and get them in the right order
SELECT * FROM chains_steps
LEFT JOIN steps ON steps.id = chains_steps.step_id
WHERE chains_steps.chain_id = ?
ORDER BY chains_steps.step_position ASC
Look up all chains that contain a step
SELECT DISTINCT chain_id FROM chains_steps
LEFT JOIN chains ON chains.id = chains_steps.chain_id
I think that the plan you've outlined is the correct approach. Don't worry too much about the presence of step_position on your mapping table. After all the step_position is a bit of data that is directly related to a step in the context of a chain. So the chains_steps table is the right place for it IMHO.
Some things to think about:
Foreign keys - use 'em!
Unique key on the chains_steps table - can a step be present in more than one position in a single chain? What about in different chains?
Good luck!

Best practice for enabling "undelete" for database entities?

This is for a CRM application using PHP/MySQL. Various entities like customer, contact, note, etc, can be "deleted" by the user. Rather than actually deleting the entity from the database, I just want it to appear deleted to the application, but be kept in the DB and able to be "restored" if needed at a later time. Maybe even add some kind of "recycle bin" to the app.
I've thought of several ways to do this:
Move the deleted entity to another table. (customer to customer_deleted)
Change an attribute on the entity. (enabled to false)
I'm sure there are other ways and that each have their own implications on DB size, performance, etc, I'm just wondering what's the generally recommended way to do something like this?
I would go a combination of both:
Set a flag deleted to true
Use a cronjob to move the entries after a while to a tabelle of type ARCHIVE
If you need to restore the entry, select into the article table and delete from Archive
Why i would go this way?
If a customer deleted the wrong one, the restore could be done instand
After a few weeks/month the article table may grow up to much, so i would archive all entries that are deleted for 1 week p.a.
A common practice is to set a deleted_at column to the date at which the entity was deleted by the user (defaults to null). You may also include a deleted_by column for marking who deleted it. Using some kind of deleted column makes FK relationships easier to work with since these wont break. By moving the row to a new table you would have to update FK (and then update them again if you ever undelete). The downside is that you have to ensure all your queries exclude deleted rows (where this wouldnt be a problem if you moved the row to a new table). Many ORM's make this filtering easier so it depends on what you are using.

Shiro + Keys stored as strings = failure point or unnecessary concern?

Not getting into too many specifics, this is a high level question.
I've always gone by the idea that it is never a good idea to store primary keys in places that does not have a constraint. For example, storing a primary key in a EAV style architecture ("USER_ID",144). If that user is ever deleted it will not be reflected in the EAV mapping and cause issues farther down the road.
So I'm creating a new application using shiro as a security/permission framework and I need users to be able to edit themselves but not other users, I also need other users to be able to edit anyone. Simple enough:
user1 = "user:441:edit"
user2 = "user:edit"
In addition, I could have someone that can only edit a subset of users, something like this
user3 = "user:459:edit","user:460:edit","user:461:edit"
Or, someone that can edit users that are in a department but only that department
user4 = "department:5898:user:edit"
If someone from user3's list is deleted there's no way to update that user's permissions without magic (going through all the permissions and finding the ones to remove).
Now I don't plan to reset the keys but if it was to ever happen and I DON'T clean up the old keys I could have users suddenly being able to edit users who were recently created after reusing the old keys.
I could mitigate some of this by using a generated code ("user:ciS84nFSHK:edit") that is unique across ALL TABLES to manage deletion of permissions. However, adding in a few hundred million records makes me think this can grow unwieldy quickly.
Am I using Shiro improperly? Am I simply overfocused on keys getting mangled? Have you solved these issues? Any help would be appreciated.
After doing some more research with Shrio I've decided to go with pst's suggestion and simply reference the values via a GUID (maybe a 10-digit-alphanumeric instead to preserve space).
So user1 = "user:441:edit" would become "user:96aae854-fc40-42ff-b948-c8944c2fca92:edit" this will help negate the concern about storing keys as strings.

Database schema for ACL

I want to create a schema for a ACL; however, I'm torn between a couple of ways of implementing it.
I am pretty sure I don't want to deal with cascading permissions as that leads to a lot of confusion on the backend and for site administrators.
I think I can also live with users only being in one role at a time. A setup like this will allow roles and permissions to be added as needed as the site grows without affecting existing roles/rules.
At first I was going to normalize the data and have three tables to represent the relations.
ROLES { id, name }
RESOURCES { id, name }
PERMISSIONS { id, role_id, resource_id }
A query to figure out whether a user was allowed somewhere would look like this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
Then I realized that I will only have about 20 resources, each with up to 5 actions (create, update, view, etc..) and perhaps another 8 roles. This means that I can exercise blatant disregard for data normalization as I will never have more than a couple of hundred possible records.
So perhaps a schema like this would make more sense.
ROLES { id, name }
PERMISSIONS { id, role_id, resource_name }
which would allow me to lookup records in a single query
SELECT * FROM permissions WHERE role_id = ? AND permission = ? ($user_role_id, 'post.update')
So which of these is more correct? Are there other schema layouts for ACL?
In my experience, the real question mostly breaks down to whether or not any amount of user-specific access-restriction is going to occur.
Suppose, for instance, that you're designing the schema of a community and that you allow users to toggle the visibility of their profile.
One option is to stick to a public/private profile flag and stick to broad, pre-emptive permission checks: 'users.view' (views public users) vs, say, 'users.view_all' (views all users, for moderators).
Another involves more refined permissions, you might want them to be able to configure things so they can make themselves (a) viewable by all, (b) viewable by their hand-picked buddies, (c) kept private entirely, and perhaps (d) viewable by all except their hand-picked bozos. In this case you need to store owner/access-related data for individual rows, and you'll need to heavily abstract some of these things in order to avoid materializing the transitive closure of a dense, oriented graph.
With either approach, I've found that added complexity in role editing/assignment is offset by the resulting ease/flexibility in assigning permissions to individual pieces of data, and that the following to worked best:
Users can have multiple roles
Roles and permissions merged in the same table with a flag to distinguish the two (useful when editing roles/perms)
Roles can assign other roles, and roles and perms can assign permissions (but permissions cannot assign roles), from within the same table.
The resulting oriented graph can then be pulled in two queries, built once and for all in a reasonable amount of time using whichever language you're using, and cached into Memcache or similar for subsequent use.
From there, pulling a user's permissions is a matter of checking which roles he has, and processing them using the permission graph to get the final permissions. Check permissions by verifying that a user has the specified role/permission or not. And then run your query/issue an error based on that permission check.
You can extend the check for individual nodes (i.e. check_perms($user, 'users.edit', $node) for "can edit this node" vs check_perms($user, 'users.edit') for "may edit a node") if you need to, and you'll have something very flexible/easy to use for end users.
As the opening example should illustrate, be wary of steering too much towards row-level permissions. The performance bottleneck is less in checking an individual node's permissions than it is in pulling a list of valid nodes (i.e. only those that the user can view or edit). I'd advise against anything beyond flags and user_id fields within the rows themselves if you're not (very) well versed in query optimization.
This means that I can exercise blatant
disregard for data normalization as I
will never have more than a couple
hundred possible records.
The number of rows you expect isn't a criterion for choosing which normal form to aim for.
Normalization is concerned with data integrity. It generally increases data integrity by reducing redundancy.
The real question to ask isn't "How many rows will I have?", but "How important is it for the database to always give me the right answers?" For a database that will be used to implement an ACL, I'd say "Pretty danged important."
If anything, a low number of rows suggests you don't need to be concerned with performance, so 5NF should be an easy choice to make. You'll want to hit 5NF before you add any id numbers.
A query to figure out if a user was
allowed somewhere would look like
this:
SELECT id FROM resources WHERE name = ?
SELECT * FROM permissions
WHERE role_id = ? AND resource_id = ? ($user_role_id, $resource->id)
That you wrote that as two queries instead of using an inner join suggests that you might be in over your head. (That's an observation, not a criticism.)
SELECT p.*
FROM permissions p
INNER JOIN resources r ON (r.id = p.resource_id AND
r.name = ?)
You can use a SET to assign the roles.
CREATE TABLE permission (
id integer primary key autoincrement
,name varchar
,perm SET('create', 'edit', 'delete', 'view')
,resource_id integer );

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!