How can one prevent mysql insertion within a particular timestamp interval? - mysql

I have a system whereby users can input data into a mysql table from many sites across the globe.
The data is posted via ajax to my table without issues. But, I would like to improve my insertion code to prevent insertion if the timestamp is within some interval. This would weed out duplicate rows in my table.
Before you get angry -> I do understand I can set a primary key to certain columns and prevent duplicate insertion.
In my use case, I need to allow duplications of the numeric data where it is truly duplicated values from a unique submission -> this is valid in my case. I would like to leverage the timestamp to weed out obvious double insertions where the variables were submitted by accident twice.
I have tried to disable the button for 1-2 seconds, but this hasn't solved the problem entirely.
If I have columns: weight, height, country and the timestamp, I'd like to somehow check if there is an insert within n sections of the timestamp, where the post includes data that matches these variables. This would tell me that there is an accidental duplication from a user and I shouldn't insert it into the database.
I'm not too familiar with MYSQL, so I was hoping to get some guidance here.
Thanks.

There are different solutions, depending on the specifics of your case:
If you need to apply some rule that validates the new row using values inside the row itself a CHECK constraint will do. Consider, though, that MySQL enforces CHECK constraints starting in version 8.0.3 (if I remember well).
If you want to enforce a rule in relation to other rows, you can serialize the insertions into a queue. The consumer of the queue will validate the insertions one by one and will accept or reject them. Consider that serialization is not a good option for massive level of insertions, since it produce a bottleneck (this may be your case since you say insertions from across the globe).
Alternatively, you can use optimistic insertion, and always produce the insertion with an intermediate status "waiting for validation". Then other process(es) can validate the row. If all is good, then the row is approved; if not, then a compensation procedure is executed, in a-la-microservice way.
Which one is your case?

Related

Can/should I make id column that is part of a composite key non-unique [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

What is the purpose of re-setting AUTO_INCREMENT in MySQL?

I have encountered the fact that some people, after performing deletion of rows from a table, also reset the AUTO_INCREMENT for the primary key column of that table to re-number all the values as if they started from 1 again (or whatever the initial starting point).
My question is, is there a specific reason for doing this, other than just preference? As in, is there any detrimental impact on the database or future queries if you do not reset the auto-increment and just leave it as-is? If there is, could somebody provide an example where it would be necessary to reset AUTO_INCREMENT?
Thanks!
I don't think it is ever necessary to reset auto_increment, unless you are running out of values.
One case where auto-increment is often reset is when all the rows are deleted. If you use truncate table, then the auto-increment value is reset automatically. This does not always happen with delete without a where clause, so for consistency, you might want to reset it.
Another case is when a large insert fails, particularly if it fails repeatedly. You might not want the really large gaps.
When moving tables around you might want to keep the original id values. So, essentially, you ignore the auto-increment on inserts. Afterwards, though, you might want to set the automatic value to be consistent with other systems.
In general, though, resetting the auto-increment is not recommended.
Unfortunately, I've seen this behavior. And from what I observed, it's not due to a technical reason - it's closer to OCD.
Some people really don't like gaps in the ID column - they like the idea of it smoothly increasing by 1 for each record. The idea that some manual data manipulation they're doing screwing that up isn't pleasant - so they go through some hoops to make sure they don't cause gaps in the numbers.
But, yeah, this is a terrible practice. It's just asking for data integrity problems.
Resetting auto-inc is an uncommon operation. Under normal day to day work, just let it keep incrementing.
I've done reset of auto-inc in MySQL instances used for automated testing. A given set of tables is loaded with data over and over, and deletes its test data afterwards. Resetting the auto-inc may be the best way to make tests repeatable, if they're looking for specific values in the results.
Another scenario is when creating archive tables. Suppose you have a huge table, and you want to empty out the data efficiently (not using DELETE), but you want to archive the data, and you want new data to use id values higher than your old data.
CREATE TABLE mytable_new LIKE mytable;
SELECT AUTO_INCREMENT FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME='mytable';
ALTER TABLE mytable_new AUTO_INCREMENT = /* value + 10000 */;
RENAME TABLE mytable TO mytable_archive, mytable_new TO mytable;
The above series of statements allow you to shuffle a new empty table into place atomically, so your app can continue writing to the table by the name it's used to. The auto-inc value you reset in the new table should be a value higher than the max id value in the old table, plus some comfortable gap to avoid overlap during the moments between the statements.
Reseting the auto increment usually helps in terms of organization, you can see no gap between id 6 and 60 if the rows between have been deleted.
However, you should be carefull about working with resetting auto-increments, because most likely, your code will depend on specific id's to fetch certain information.
In my opinion, just truncate the whole thing after your tests and seed the database with the correct information. If it's production, let it run wild and free, it could cause more harm and no beneficial output
As per comment on abr's answer, assuming that auto-increment ids are contiguous (or even sequential) is not just a bad idea, it is a dangerous one.
There may be good reason for deliberately creating gaps in the allocated ids if you intend to patch the data at a later point (e.g. if you have restored from an old backup and expect to recover some of the missing data but need to restore a service asap) or when you migrate from a single active server to multiple master nodes. But in these scenarios you are setting the counter to higher value than currently used - not resetting it back to the start.
If there is a risk that you are going to wrap around the numbers, then you've probably picked the wrong data type for your auto-increment attribute - changing the data type is the right way to fix the problem, not deleting data and resetting the counter to 0.

How to INSERT multiple rows when some might be DUPLICATES of an already-existing row?

So I have a checkbox form where users can select multiple values. Then can then go back and select different values. Each value is stored as a row (UserID,value).
How do you do that INSERT when some rows might be duplicates of an already-existing row in the table?
Should I first delete the existing values and then INSERT the new values?
ON DUPLICATE KEY UPDATE seems tricky since I would be INSERTing multiple rows at once, so how would I define and separate just the ones that need UPDATING vs. the ones that need INSERTING?
For example, let's say a user makes his first-time selection:
INSERT INTO
Choices(UserID,value)
VALUES
('1','banana'),('1','apple'),('1','orange'),('1','cranberry'),('1','lemon')
What if the user goes back later and makes different choices which include SOME of the values in his original query which will thus cause duplicates?
How should I handle that best?
In my opinion, simply deleting the existing choices and then inserting the new ones is the best way to go. It may not be the most efficient overall, but it is simple to code and thus has a much better chance of being correct.
Otherwise it is necessary to find the intersection of the new choices and old choices. Then either delete the obsolete ones or change them to the new choices (and then insert/delete depending on if the new set of choices is bigger or smaller than the original set). The added risk of the extra complexity does not seem worth it.
Edit As #Andrew points out in the comments, deleting the originals en masse may not be a good plan if these records happened to be "parent" records in a referential integrity definition. My thinking was that this seemed like an unlikely situation based on the OP's description. But it is definitely worth consideration.
It's not clear to me when you would ever need to update a record in the database in your case.
It sounds like you need to maintain a set of choices per user, which the user may on occasion change. Therefore, each time the user provides a new set of choices, any prior set of choices should be discarded. So you would delete all old records, then insert any new ones.
You might consider carrying out a comparison of the prior and new choices - either in the server or client code - in order to calculate the minimum set of deletes and/or inserts needed to reduce database writes. But that smells like premature optimisation.
Putting all that to one side - if you want a re-insert to be ignored then you should use INSERT IGNORE, then existing rows will be quietly ignored and new ones will be inserted.
I don't know much about mysql but in MS SQL 2000+ we can execute a stored proc with XML as one of it's parameters. This XML would contain a list of identity-value pairs. We would open this XML as a table using openxml and figure out which rows need to be deleted or inserted using left or right outer join. As of SQL 2008 (I think) we have a new merge statement that let's us perform delete, update and insert row operations in one statement on ONE table. This way we can take advantage of Set mathematical operations from SQL instead of looping through arrays in the application code.
You can also keep your select list retrieved from the database in session and compare the "old list" to the "newly selected list" in your application code. You would need to figure out which rows need to be deleted or added. You probably don't need to worry about updates because you are probably only keeping foreign keys in this table and the descriptions are in some kind of a reference table.
There is another way in SQL 2008 that involves using user defined data-types as custom tables but I don't know much about it.
Personally, I prefer the XML route because you just send the end-state into the sp and your sp automatically figures out which rows need to deleted or inserted.
Hope this helps.

Indexing only one MySQL column value

I have a MySQL InnoDB table with a status column. The status can be 'done' or 'processing'. As the table grows, at most .1% of the status values will be 'processing,' whereas the other 99.9% of the values will be 'done.' This seems like a great candidate for an index due to the high selectivity for 'processing' (though not for 'done'). Is it possible to create an index for the status column that only indexes the value 'processing'? I do not want the index to waste an enormous amount of space indexing 'done.'
I'm not aware of any standard way to do this but we have solved a similar problem before by using two tables, Processing and Done in your case, the former with an index, the latter without.
Assuming that rows don't ever switch back from done to processing, here's the steps you can use:
When you create a record, insert it into the Processing table with the column set to processing.
When it's finished, set the column to done.
Periodically sweep the Processing table, moving done rows to the Done table.
That last one can be tricky. You can do the insert/delete in a transaction to ensure it transfers properly or you could use a unique ID to detect if it's already transferred and then just delete it from Processing (I have no experience with MySQL transaction support which is why I'm also giving that option).
That way, you're only indexing a few of the 99.9% of done rows, the ones that have yet to be transferred to the Done table. It will also work with multiple states of processing as you have alluded to in comments (entries are only transferred when they hit the done state, all other states stay in the Processing table).
It's akin to having historical data (stuff that will never change again) transferred to a separate table for efficiency. It can complicate some queries where you need access to both done and non-done rows since you have to join two tables so be aware there's a trade-off.
Better solution: don't use strings to indicate statuses. Instead use constants in your code with descriptive names => integer values. Then that integer is stored in the database, and MySQL will work a LOT faster than with strings.
I don't know what language you use, but for example in PHP:
class Member
{
const STATUS_ACTIVE = 1;
const STATUS_BANNED = 2;
}
if ($member->getStatus() == Member::STATUS_ACTIVE)
{
}
instead of what you have now:
if ($member->getStatus() == 'active')
{
}

What is the best method/options for expiring records within a database?

In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.
So my question is how best to expire the record.
I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause any time you only want the current items.
Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).
Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.
Are there other solutions or modifications of these that are better?
I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.
I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.
I like the expired_flag option over the date_expired option, if query speed is important to you.
I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.
I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.
In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.
There's a related post here that offers a few other options. Maybe the CDC option?
SQL Server history table - populate through SP or Trigger?
May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.
Some possible enumerated values to use, depending on your needs:
Active
Deleted
Suspended
InUse (Sort of a pseudo-locking mechanism)
Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.
I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.
There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).
Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.
None is "best" -- each responds to different requirements.
Here's a tidy summary.
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 4 is a variation on this moves the history to another table.
Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.
Type 3: The original record is modified to reflect the change.
In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.
You can read more about this if you search for "Slowly Changing Dimension".
http://en.wikipedia.org/wiki/Slowly_Changing_Dimension
A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.