Mapping records between databases using identification numbering system - mysql

I have 2 databases, one mySQL database and a SQLite which sycronize back and forth to maintain the same data. To prevent duplicates on either side I was thinking of having a identifcation numbering sytem for records but im not sure how I will go about that?
I need to somehow create a unique ID for records on both databases, for example:
mySQL ===> data = 1, 5 id=???
sqLITE===> data = 1, 5 id=???
I need the ID to be the same, so when I syncronize it will not transfer over to the other database.
Another way I thought of is creating a hash between 2 columns in the database, and if the same data is on the other server then it does not syncronize that record of data.
Using a column of the database table as a unique identifier is not suitable in my case.
I'm really not sure how to go about this, so any help will be great, thanks!

I understand the question in the way that you need to somehow identify if two rows in two different SQL databases are the same, either because they were independently created or because of an earlier sync.
I think your idea with a hash value is fine. it should do the trick. However, you also could just concatenate the column values in a string and get the same result, maybe with a dash in between in case you have several data columns that would otherwise become ambiguous ("12-2" and "1-12" are then different)
But you still need to send over the generated hash values or concatenated strings of all rows in order to sync. Maybe it makes sense to track rows that are already synced? But then you may need to untrack them if updates of row data values happen.
I am not sure if this answer is helpful to you, because the question leaves many points open to speculation. Can I suggest to make it a bit more clear what you try to achieve?

Related

Handle duplicates without removing them from database

I would like to know if there's some regular way to handle duplicates in the database without actually removing the duplicated rows. Or a specific name for what I'm trying to achieve, so I can check it out.
Why would I keep duplicates? Because I have to monitor them. I have to know that they're duplicates and are not e.g. searchable, but at the same time, I have to keep them, because I update the rows from external source and if I'd remove them, they'd go back to the database as soon as I update from external source.
I have two ideas:
Have an additional boolean column "searchable", but I feel it's a partial solution, it can turn out to be insufficient in the future
Have an additional column "duplicate_of". It would keep id of the column of which the row is duplicate. It would be a foreign key of the same table which is kind of weird., isn't it?
I know it's not a specific programming question, but I think that someone must have handled a similar situation (Facebook - Pages they keep track of those which are duplicates of others) and it would be great to know a verified solution.
EDIT: these are close duplicates, indetified mainly by their location (lat, lng), so DISTINCT is probably not a solution here
I would create a view that has DISTINCT values. Having an additional column to be searchable sounds tedious. Your second idea is actually more feasible and there is nothing weird about a self-referencing table.
The solution depends on several other factors. In particular, does the database support real deletes and updates (apart from setting the duplication information)?
You have a range of solutions. One is to place distinct values in a separate table, periodically. This works well if you have batch inserts, and no updates/deletes.
If you have a database that is being updated, then you might want to maintain a version number on the record. This lets you track it. Presumably, if it is a duplicate, there is another duplicate key inside it.
The problem with your second approach is that it can result in a tree-like structure of duplicates. Where A-->B-->C and D--> so A and D are duplicates, but this is not obvious. If you always put in the earliest value and there are no updates or deletes, then this solution is reasonable.

How to INSERT multiple rows when some might be DUPLICATES of an already-existing row?

So I have a checkbox form where users can select multiple values. Then can then go back and select different values. Each value is stored as a row (UserID,value).
How do you do that INSERT when some rows might be duplicates of an already-existing row in the table?
Should I first delete the existing values and then INSERT the new values?
ON DUPLICATE KEY UPDATE seems tricky since I would be INSERTing multiple rows at once, so how would I define and separate just the ones that need UPDATING vs. the ones that need INSERTING?
For example, let's say a user makes his first-time selection:
INSERT INTO
Choices(UserID,value)
VALUES
('1','banana'),('1','apple'),('1','orange'),('1','cranberry'),('1','lemon')
What if the user goes back later and makes different choices which include SOME of the values in his original query which will thus cause duplicates?
How should I handle that best?
In my opinion, simply deleting the existing choices and then inserting the new ones is the best way to go. It may not be the most efficient overall, but it is simple to code and thus has a much better chance of being correct.
Otherwise it is necessary to find the intersection of the new choices and old choices. Then either delete the obsolete ones or change them to the new choices (and then insert/delete depending on if the new set of choices is bigger or smaller than the original set). The added risk of the extra complexity does not seem worth it.
Edit As #Andrew points out in the comments, deleting the originals en masse may not be a good plan if these records happened to be "parent" records in a referential integrity definition. My thinking was that this seemed like an unlikely situation based on the OP's description. But it is definitely worth consideration.
It's not clear to me when you would ever need to update a record in the database in your case.
It sounds like you need to maintain a set of choices per user, which the user may on occasion change. Therefore, each time the user provides a new set of choices, any prior set of choices should be discarded. So you would delete all old records, then insert any new ones.
You might consider carrying out a comparison of the prior and new choices - either in the server or client code - in order to calculate the minimum set of deletes and/or inserts needed to reduce database writes. But that smells like premature optimisation.
Putting all that to one side - if you want a re-insert to be ignored then you should use INSERT IGNORE, then existing rows will be quietly ignored and new ones will be inserted.
I don't know much about mysql but in MS SQL 2000+ we can execute a stored proc with XML as one of it's parameters. This XML would contain a list of identity-value pairs. We would open this XML as a table using openxml and figure out which rows need to be deleted or inserted using left or right outer join. As of SQL 2008 (I think) we have a new merge statement that let's us perform delete, update and insert row operations in one statement on ONE table. This way we can take advantage of Set mathematical operations from SQL instead of looping through arrays in the application code.
You can also keep your select list retrieved from the database in session and compare the "old list" to the "newly selected list" in your application code. You would need to figure out which rows need to be deleted or added. You probably don't need to worry about updates because you are probably only keeping foreign keys in this table and the descriptions are in some kind of a reference table.
There is another way in SQL 2008 that involves using user defined data-types as custom tables but I don't know much about it.
Personally, I prefer the XML route because you just send the end-state into the sp and your sp automatically figures out which rows need to deleted or inserted.
Hope this helps.

Linux diff and patch command line utilities for MySQL data (not structure)

I have two MySQL databases and I would like to write a script to compare and update data changes between them.
Does anyone know a Linux command line tool for diffing or patching data in MySQL databases?
The Way of the Brute Force: Dump both databases and diff the dumps...? ;-)
-- "If your problem is not solved by brute force, you are not applying enough force."
(I'm not (entirely) serious about this...)
diff
As DevSolar suggests, the simple way to get the differences is to do a careful dump of the two databases (with one output file per table, and within each file, one logical line per record in the table), and apply the admirable, reliable and venerable diff program to the files for each table. However, that 'careful' may be something of a spanner in the works - you need to ensure that each data file is dumped in a sorted order (not just a physical order), so that if a record appears in both, it appears at the same position in the file. If the data is not so ordered, you will get lots of spurious differences.
Even before you do that, you need to compare the two schemas - because many differences in the schema will automatically make every row in two tables with the same name have every row different. For example, if TableA from database DB1 has 10 columns, but TableA from Db2 has 11 columns, every row in the dumped data will be different.
You also need to worry about some other columns that can differ - notably automatically assigned ID numbers, and also 'last update time' or 'creation time' values. The automatic ID numbers in a primary key will often strongly influence the order of data in tables that join to the PK - you have to consider whether there is a good way around that. It will depend in part on the history of the databases; were they once a common database that got copied, modified, and are now being recombined? If so, there may be less problem than if they are two databases with the same schema but which have never had any common ancestry to the data stored in it.
You may find that your best bet is to create views such that the data structure reflected by the view is the same for both databases (even if the view definition is not the same because of differences in the schema). You can then compare the results of dumping those views. Done carefully, this can alleviate or minimize the differences due to automatically assigned ID numbers.
patch
Let's pretend you managed to get comparable data and you now need to synchronize a modest number of differences. Is there a patching tool to do the job?
The answer is quite likely to the one you did not want to hear - No.
One issue is that you have to decide what is the required result of the operation. Is it the union of the two databases, or the intersection, or what? Which database are you going to modify - the first or the second, or both?
Rows deleted from one database that appear in the other can either be removed from the other or inserted into the one. Rows inserted are the mirror of rows deleted and need analogous treatment. That was easy...
Where the 'same row' appears in both databases by some criterion, but there are differences in the fields (columns), then you have a trickier job to do. You have to decide which of the different columns should be changed in the database you're currently modifying. The standard Unix tools (such as diff) are designed for line-based differences. At this point, I'd probably drop into Perl (but Python or other scripting languages would do fine), taking the difference records for a table along with a table name and the column list (so that the fields in the data can be associated with columns in the database), and then arrange for it to generate the appropriate statements. Types may be a factor - your UPDATE statement may need to quote strings and not quote numbers for updates. You also need to know the primary key so that you can identify the row to be updated. The output would be a suitable set of UPDATE statements that would morph the first version of the table into the second.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability

The ultimate MySQL legacy database nightmare

Table1:
Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.
Table2:
Any hope of normalization went out the window upon cracking this baby open.
We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1
Table3:
Can it get any worse. Oh yes.
The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.
Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.
At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".
Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.
However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.
Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!
Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...
The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".
I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)
The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.
I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!
Your thoughts would be greatly appreciated.
I am not a MySQL person, so this is coming out of left field.
But I think the log files might be the answer.
Thankfully, you really only need to know 2 things from the log.
You need the record/rowid, and you need the operation.
In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.
Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.
You consolidate all of this information, in time order, and then run through it.
For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.
You don't care about field values, they're just not important. Do the whole row.
You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).
This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.
Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.
The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE!
For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.
Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.
you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"
You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.
I imagine that this could be done on a daily basis without a performance problem.
Doing it all off a different server will avoid impacting the original database.
The only problem I can see is if some of the tables don't have primary keys.