I would like to know if there's some regular way to handle duplicates in the database without actually removing the duplicated rows. Or a specific name for what I'm trying to achieve, so I can check it out.
Why would I keep duplicates? Because I have to monitor them. I have to know that they're duplicates and are not e.g. searchable, but at the same time, I have to keep them, because I update the rows from external source and if I'd remove them, they'd go back to the database as soon as I update from external source.
I have two ideas:
Have an additional boolean column "searchable", but I feel it's a partial solution, it can turn out to be insufficient in the future
Have an additional column "duplicate_of". It would keep id of the column of which the row is duplicate. It would be a foreign key of the same table which is kind of weird., isn't it?
I know it's not a specific programming question, but I think that someone must have handled a similar situation (Facebook - Pages they keep track of those which are duplicates of others) and it would be great to know a verified solution.
EDIT: these are close duplicates, indetified mainly by their location (lat, lng), so DISTINCT is probably not a solution here
I would create a view that has DISTINCT values. Having an additional column to be searchable sounds tedious. Your second idea is actually more feasible and there is nothing weird about a self-referencing table.
The solution depends on several other factors. In particular, does the database support real deletes and updates (apart from setting the duplication information)?
You have a range of solutions. One is to place distinct values in a separate table, periodically. This works well if you have batch inserts, and no updates/deletes.
If you have a database that is being updated, then you might want to maintain a version number on the record. This lets you track it. Presumably, if it is a duplicate, there is another duplicate key inside it.
The problem with your second approach is that it can result in a tree-like structure of duplicates. Where A-->B-->C and D--> so A and D are duplicates, but this is not obvious. If you always put in the earliest value and there are no updates or deletes, then this solution is reasonable.
Related
Everyone says don't re-use deleted MySql keys. eg. Stack Overflow question: I want to reuse the gaps of the deleted rows
I have read all of the "expert" opinions but have not found a single answer that gives a valid reason why not. Everyone simply asks "why do you want to"?
Well here is a very good reason. If my users have a choice of entering URL mysite.com/person.php?id=123 or a URL mysite.com/person.php?id=123456789123, which one would they most likely prefer?
So can anyone give me a reason why re-using 123 would be a bad idea? I am actually not talking about one record. My records get added and deleted in blocks of several thousand. Updates are very rare and I am the only person who does updates.
There are also no dependencies. Nothing points to those records so there are no integrity issues with other tables.
When I want to add another block of records I will have a simple search routine that searches for the first block of unused record keys large enough to accommodate all of the records being added. Much the same way that hard disk space usage works.
Keys are usually used as unique identifiers, if they are used again, they stop being unique, and become shared. This is the logic behind the idea of not to reuse keys.
So I would suggest, split the key and the id of the user, to two fields, key the key as unique, and the id make it "choose-able" via a gap-finding function.
Before you split, create this new column called user-id, and copy to it the id (which is currently your key) of the users.
Then make this column unique, so that you prevent accidental cases of id reuse.
And you are "home" free.
So I have a checkbox form where users can select multiple values. Then can then go back and select different values. Each value is stored as a row (UserID,value).
How do you do that INSERT when some rows might be duplicates of an already-existing row in the table?
Should I first delete the existing values and then INSERT the new values?
ON DUPLICATE KEY UPDATE seems tricky since I would be INSERTing multiple rows at once, so how would I define and separate just the ones that need UPDATING vs. the ones that need INSERTING?
For example, let's say a user makes his first-time selection:
INSERT INTO
Choices(UserID,value)
VALUES
('1','banana'),('1','apple'),('1','orange'),('1','cranberry'),('1','lemon')
What if the user goes back later and makes different choices which include SOME of the values in his original query which will thus cause duplicates?
How should I handle that best?
In my opinion, simply deleting the existing choices and then inserting the new ones is the best way to go. It may not be the most efficient overall, but it is simple to code and thus has a much better chance of being correct.
Otherwise it is necessary to find the intersection of the new choices and old choices. Then either delete the obsolete ones or change them to the new choices (and then insert/delete depending on if the new set of choices is bigger or smaller than the original set). The added risk of the extra complexity does not seem worth it.
Edit As #Andrew points out in the comments, deleting the originals en masse may not be a good plan if these records happened to be "parent" records in a referential integrity definition. My thinking was that this seemed like an unlikely situation based on the OP's description. But it is definitely worth consideration.
It's not clear to me when you would ever need to update a record in the database in your case.
It sounds like you need to maintain a set of choices per user, which the user may on occasion change. Therefore, each time the user provides a new set of choices, any prior set of choices should be discarded. So you would delete all old records, then insert any new ones.
You might consider carrying out a comparison of the prior and new choices - either in the server or client code - in order to calculate the minimum set of deletes and/or inserts needed to reduce database writes. But that smells like premature optimisation.
Putting all that to one side - if you want a re-insert to be ignored then you should use INSERT IGNORE, then existing rows will be quietly ignored and new ones will be inserted.
I don't know much about mysql but in MS SQL 2000+ we can execute a stored proc with XML as one of it's parameters. This XML would contain a list of identity-value pairs. We would open this XML as a table using openxml and figure out which rows need to be deleted or inserted using left or right outer join. As of SQL 2008 (I think) we have a new merge statement that let's us perform delete, update and insert row operations in one statement on ONE table. This way we can take advantage of Set mathematical operations from SQL instead of looping through arrays in the application code.
You can also keep your select list retrieved from the database in session and compare the "old list" to the "newly selected list" in your application code. You would need to figure out which rows need to be deleted or added. You probably don't need to worry about updates because you are probably only keeping foreign keys in this table and the descriptions are in some kind of a reference table.
There is another way in SQL 2008 that involves using user defined data-types as custom tables but I don't know much about it.
Personally, I prefer the XML route because you just send the end-state into the sp and your sp automatically figures out which rows need to deleted or inserted.
Hope this helps.
I've inherited the task of maintaining a very poorly-coded e-commerce site and I'm working on refactoring a lot of the code and trying to fix ongoing bugs.
Every database insert (adding an item to cart, etc.) begins with a grab_new_id function which COUNTs the number of rows in the table, then, starting with that number, querys the database to find an unused index number. In addition to being terrible performance-wise (there are 40,000+ rows already, and indexes are regularly deleted, so sometimes it takes several seconds just to find a new id) this breaks regularly when two operations are preformed simultaneously, as two entries are added with duplicate id numbers.
This seems idiotic to me - why not just use auto-increment on the index field? I've tested it both ways, and adding rows to the table without specifying an index id is (obviously) many times faster. My question is: can anyone think of any reason the original programmer might have done this? Is there some school of thought where auto_increment is somehow considered bad form? Are there databases that don't have auto-increment capabilities?
I've seen this before from someone that didn't know that feature existed. Definitely use the auto-increment feature.
Some people take the "roll your own" approach to everything, often because they haven't taken the time to see if that is an available feature or if someone else had already come up with it. You'll often see crazy workarounds or poor performing/fragile code from these people. Inheriting a bad database is no fun at all, good luck!
Well Oracle has sequences but not auto-generated ids as I understand it. However, usually this kind of stuff is done by devs who don't understand database programming and who hate to see gaps in the data (as you get from rollbacks). There are also people who like to create the id, so they have it available beforhand to use for child tables, but most databases with autogenerated ids also have a way to return that id to the user at the time of creation.
The only issue I found partially reasonable (but totally avoidable!) against auto_inc fields is that some backup tools by default include auto_inc values into table definition even if you don't include data into a db dump that may be inconvenient.
Depending on the specific situation, there are clearly many reasons for not using consecutive numbers as a primary key.
However, under the given that I do want consecutive numbers as a primary key, I see no reason not to use the built in auto_increment functionality MySQL offers
It was probably done that way for historical reasons; i.e. earlier versions didn't have autoinc variables. I've written code that uses manual autoinc fields on databases that don't support autoinc types, but my code wasn't quite as inefficient as pulling a count().
One issue with using autoinc fields as a primary key is that moving records in and out of tables may result in the primary key changing. So, I'd recommend designing in a "LegacyID" field up front that can be used as future storage for the primary key for times when you are moving records in and out of the table.
They may just have been inexperienced and unfamiliar with auto increment. One reason I can think of, but doesn't necessarily make much sense, is that it is difficult (not impossible) to copy data from one environment to another when using auto increment id's.
For this reason, I have used sequential Guids as my primary key before for ease of transitioning data, but counting the rows to populate the ID is a bit of a WTF.
Two things to watch for:
1.Your RDBMS intelligently sets the auto-increment value upon restart. Our engineers were rolling their own auto-increment key to get around the auto-increment field jumping by an order of 100000s whenever the server restarted. However, at some point Sybase added an option to set the size of the auto-increment.
2.The other place where auto-increment can get nasty is if you are replicating databases and are using a master-master configuration. If you write on both databases (NOT ADVISED), you can run into identity-collision.
I doubt either of these were the case, but things to be aware of.
I could see if the ids were generated on the client and pushed into the database, this is common practice when speed is necessary, but what you discribed seems over the top and unnecessary. Remove it and start an auto incrementing id.
In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.
So my question is how best to expire the record.
I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause any time you only want the current items.
Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).
Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.
Are there other solutions or modifications of these that are better?
I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.
I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.
I like the expired_flag option over the date_expired option, if query speed is important to you.
I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.
I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.
In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.
There's a related post here that offers a few other options. Maybe the CDC option?
SQL Server history table - populate through SP or Trigger?
May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.
Some possible enumerated values to use, depending on your needs:
Active
Deleted
Suspended
InUse (Sort of a pseudo-locking mechanism)
Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.
I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.
There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).
Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.
None is "best" -- each responds to different requirements.
Here's a tidy summary.
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 4 is a variation on this moves the history to another table.
Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.
Type 3: The original record is modified to reflect the change.
In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.
You can read more about this if you search for "Slowly Changing Dimension".
http://en.wikipedia.org/wiki/Slowly_Changing_Dimension
A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.
Table1:
Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.
Table2:
Any hope of normalization went out the window upon cracking this baby open.
We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1
Table3:
Can it get any worse. Oh yes.
The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.
Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.
At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".
Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.
However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.
Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!
Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...
The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".
I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)
The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.
I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!
Your thoughts would be greatly appreciated.
I am not a MySQL person, so this is coming out of left field.
But I think the log files might be the answer.
Thankfully, you really only need to know 2 things from the log.
You need the record/rowid, and you need the operation.
In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.
Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.
You consolidate all of this information, in time order, and then run through it.
For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.
You don't care about field values, they're just not important. Do the whole row.
You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).
This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.
Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.
The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE!
For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.
Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.
you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"
You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.
I imagine that this could be done on a daily basis without a performance problem.
Doing it all off a different server will avoid impacting the original database.
The only problem I can see is if some of the tables don't have primary keys.