I am using Mysql 5 and Ruby On Rails 2.3.1. I have some validations that occasionally fail to prevent duplicate items being saved to my database. Is it possible at the database level to restrict a duplicate entry to be created based on certain parameters?
I am saving emails to a database, and don't want to save a duplicate subject line, body, and sender address. Does anyone know how to impose such a limit on a DB through a migration?
You have a number of options to ensure a unique value set is inserted into your table. Lets consider 1) Push responsibility to the database engine or 2) your application's responsibilitiy.
Pushing responsibility to the database engine could entail the use of creating a UNIQUE index on your table. See MySql Create Index syntax. Note, this solution may result in an exception thrown in case a duplicate value is inserted. As you've identified what I infer to be three columns to determine uniqueness (subject line, body, and sender address) you'll create the index to include all three columns. Its been a while since I've worked with Rails so you may want to check the record count inserted as well.
If you desire to push this responsibility to your application software you'll need to contend with potential data insertion conflicts. That is, assume you have two users creating an email simultaneously (just work with me here) having the same subject line, body, and send address. Should your code simple query for any records consisting of the text (identical for both users in this example) both will return no records found and will proceed along merily inserting their emails which now violate your premise. So, you can address this with perhaps a table lock, or some other syncing field in the database to ensure duplicates don't appear. This latter approach could consist of another table with a single field indicating if someone is inserting a record or not, once completed it updates that record to state it has completed and then others can proceed.
While there you can have a separate architectural discussion on the implications of each alternative I'll leave that to a separate post. Hopefully this suffices in answering your question.
You should be able to add a unique index to any columns you want to be unique throughout the table.
Related
I am setting up a database using MySQL with the goal of storing data for a potentially large number of unique users. Each user will have some basic data associated with them - a unique username, when they joined the service, how many times they have used the service, in addition to a set of their personal preferences. I am planning on keeping one table called 'Users' dedicated to these fields.
However, there is a bunch of data with a specific schema that will be collected about that user during each session that they use the service. This data includes which user performed this session, the date of the session, what the user did, etc.
My thought process is the following: if I use a single table for users that includes data on each of their sessions, this seems inefficient because there would have to be either a column for each unique session, or a column containing more or less an array or list of sessions. If I wanted to keep this data for an indeterminate number of sessions, then the one-per-column idea would break down, because I believe there is a column limit. Updating an array within a single column also seems to be frowned upon, I think for reasons having to do with preserving the integrity of the data and maintaining the best possible organization.
So it seems like I want two tables, one for users, and another for sessions. Every time anybody completes a session, data about that session will be created as a new row in the 'Sessions' table, and each row would also have a foreign key linking that session to the particular user who completed it.
Is this a correct line of thought? If not, how should I think about this?
Thanks
I would say you're pretty close. You should separate users and sessions, and you're looking at modeling a relationship. Each session only has one user, so it's a one-to-many relationship.
1 User (1 Row in the "Users" table) can have many Sessions (1 Row in the "Sessions" table)
The Foreign Key is the User ID in the Sessions table. This links each unique session (Which will have it's own Session ID I'm assuming) back to a unique User in the Users table.
If you're looking at a massive volume of users, which means a ton of sessions, you may want to consider options on how to help the sessions table not grow to be extremely huge and slow to query. If you're collecting this data on a daily basis, consider that you could "Partition" the table on dates:
Partitioning on DateTime in MySQL
edit: typos
Every implementation of a credentials table I've seen has an auto-incrmenting id to to track users.
However,
If I verify unique email addresses before inserting into a mySQL table, than I can guarantee the uniqueness of each row by email address...furthermore I can access the table as needed through the email address..
Does anyone see a problem with this?
I'm trying to understand why others don't follow this approach?
Those email addresses are much larger than 4 bytes, perhaps even worse for the storage engine they are variable length.
Also one person might want two accounts, or might have several email addresses over time.
Then there are the problems associated with case folding.
When other tables have data that relates to users, what do you use as a foreign key? Their email address? What if they want to change their email address? What would have been a single one-row update now becomes a giant mess.
A generated key allows you to decouple data that can change from the relationships between records and tables.
I am considering designing a relational DB schema for a DB that never actually deletes anything (sets a deleted flag or something).
1) What metadata columns are typically used to accomodate such an architecture? Obviously a boolean flag for IsDeleted can be set. Or maybe just a timestamp in a Deleted column works better, or possibly both. I'm not sure which method will cause me more problems in the long run.
2) How are updates typically handled in such architectures? If you mark the old value as deleted and insert a new one, you will run into PK unique constraint issues (e.g. if you have PK column id, then the new row must have the same id as the one you just marked as invalid, or else all of your foreign keys in other tables for that id will be rendered useless).
If your goal is auditing, I'd create a shadow table for each table you have. Add some triggers that get fired on update and delete and insert a copy of the row into the shadow table.
Here are some additional questions that you'll also want to consider
How often do deletes occur. What's your performance budget like? This can affect your choices. The answer to your design will be different depending of if a user deleting a single row (like lets say an answer on a Q&A site vs deleting records on an hourly basis from a feed)
How are you going to expose the deleted records in your system. Is it only through administrative purposes or can any user see deleted records. This makes a difference because you'll probably need to come up with a filtering mechanism depending on the user.
How will foreign key constraints work. Can one table reference another table where there's a deleted record?
When you add or alter existing tables what happens to the deleted records?
Typically the systems that care a lot about audit use tables as Steve Prentice mentioned. It often has every field from the original table with all the constraints turned off. It often will have a action field to track updates vs deletes, and include a date/timestamp of the change along with the user.
For an example see the PostHistory Table at https://data.stackexchange.com/stackoverflow/query/new
I think what you're looking for here is typically referred to as "knowledge dating".
In this case, your primary key would be your regular key plus the knowledge start date.
Your end date might either be null for a current record or an "end of time" sentinel.
On an update, you'd typically set the end date of the current record to "now" and insert a new record the starts at the same "now" with the new values.
On a "delete", you'd just set the end date to "now".
i've done that.
2.a) version number solves the unique constraint issue somewhat although that's really just relaxing the uniqueness isn't it.
2.b) you can also archive the old versions into another table.
I'm no database guru, so I'm curious if a table lock is necessary in the following circumstance:
We have a web app that lets users add entries to the database via an HTML form
Each entry a user adds must have a unique URL
The URL should be generated on the fly, by pulling the most recent ID from the database, adding one, and appending it to the newly created entry
The app is running on ExpressionEngine (I only mention this in case it makes my situation easier to understand for those familiar with the EE platform)
Relevant DB Columns
(exp_channel_titles)
entry_id (primary key, auto_increment)
url_title (must be unique)
My Hypothetical Solution -- is table locking required here?
Let's say there are 100 entries in the table, and each entry in the table has a url_title like entry_1, entry_2, entry_3, etc., all the way to entry_100. Each time a user adds an entry, my script would do something like this:
Query (SELECT) the table to determine the last entry_id and assign it to the variable $last_id
Add 1 to the returned value, and assign the sum to the variable $new_id
INSERT the new entry, setting the url_title field of the latest entry to entry_$new_id (the 101st entry in the table would thus have a url_title of entry_101)
Since my database knowledge is limited, I don't know if I need to worry about locking here. What if a thousand people try to add entries to the database within a 10 second period? Does MySQL automatically handle this, or do I need to lock the table while each new entry is added, to ensure each entry has the correct id?
Running on the MyISAM engine, if that makes a difference.
I think you should look at one of two approaches:
Use and AUTO_INCREMENT column to assign the id
Switching from MyISAM to the InnoDb storage engine which is fully transactional and wrapping your queries in a transaction
I am working with a database where "almost" every table in the database has the same field and same value. For example, almost all tables have a field called GroupId and there is only one group id in the database now.
Benefits
All data is related to that field and can be identified by said field
When a new group is created data will be properly identified for the group
Disadvantages
All tables have the this field
All stored procedures need to have this field as a parameter
All queries have to filtered by this field
Is this a big deal? Is there an alternative to this approach?
Thanks
If you need to be able to identify data by more than one group in the future, having foreign keys is a good practice. However, that deosn't mean all tables need to have this field, only the ones directly related to the group. For instance a lookuptable with state values may not need it, but the customers table might. Adding it to all tables willy-nilly can lead to bad things when you try to delete a record and have to check 579 tables (only 25 of which are pertinent). All this depends greatly on what the meaning of the groups is. Most of our tables have a relationship to the client table, because they contain data related to specific clients and because we don't want various clients to have the ability to see data for other clients. Tables which do not contain that kind of data do not.
Yes most queries may need the field and many stored procs will want to have it as an input variable, but if you truly need to filter on this information, then that is as it should be.
If however there is only one group and will never be more than one group, it is a waste of time, effort and space.