The ultimate MySQL legacy database nightmare - mysql

Table1:
Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.
Table2:
Any hope of normalization went out the window upon cracking this baby open.
We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1
Table3:
Can it get any worse. Oh yes.
The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.
Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.
At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".
Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.
However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.
Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!
Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...
The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".
I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)
The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.
I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!
Your thoughts would be greatly appreciated.

I am not a MySQL person, so this is coming out of left field.
But I think the log files might be the answer.
Thankfully, you really only need to know 2 things from the log.
You need the record/rowid, and you need the operation.
In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.
Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.
You consolidate all of this information, in time order, and then run through it.
For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.
You don't care about field values, they're just not important. Do the whole row.
You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).
This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.
Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.

The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE!
For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.

Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.

you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"
You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.
I imagine that this could be done on a daily basis without a performance problem.
Doing it all off a different server will avoid impacting the original database.
The only problem I can see is if some of the tables don't have primary keys.

Related

How to generate a reliable synthetic key based on unreliable user-entered data?

friends! My problem for you today is this: I'm working with a shared Excel file which several employees in my company edit to track work done using one of our tools. Every day, I'll grab the latest version of this file, scrub the bajeezus out of the user inputted data, and then load it into our MySQL database for use with our BI tools.
When I insert that data into the database, I'm using an auto-incrementing integer primary key to identify each record (as one should do). But my dilemma is this: this primary key does absolutely nothing to prevent me from inserting the same record from the Excel file multiple times. I could insert the same row numerous times and MySQL would just happily accept the clone and keep incrementing the integer.
Obviously, I have really good discipline about not inserting the same row twice, but if I fall down an open manhole and die, I'd like this process to be safe enough that someone picking up for me couldn't possibly run into this issue.
So, I want to try to come up with a natural key from the user-input that will help me uniquely identify every record in the dataset so that I can never insert the same row twice. Problem is, all of the columns in my dataset aren't always present, even the ones that would conceivably make for good natural keys on their own and, also, this being user-entered data, there's going to be a high rate of error in the data being entered.
So, what I want to know is: what are the best practices for creating a good, reliable, uniqueness-enforcing key when the data you're inserting isn't giving you much to work with? Cyclic-redundancy checksums? Chopping together a UUID generator in Power Query?

Mapping records between databases using identification numbering system

I have 2 databases, one mySQL database and a SQLite which sycronize back and forth to maintain the same data. To prevent duplicates on either side I was thinking of having a identifcation numbering sytem for records but im not sure how I will go about that?
I need to somehow create a unique ID for records on both databases, for example:
mySQL ===> data = 1, 5 id=???
sqLITE===> data = 1, 5 id=???
I need the ID to be the same, so when I syncronize it will not transfer over to the other database.
Another way I thought of is creating a hash between 2 columns in the database, and if the same data is on the other server then it does not syncronize that record of data.
Using a column of the database table as a unique identifier is not suitable in my case.
I'm really not sure how to go about this, so any help will be great, thanks!
I understand the question in the way that you need to somehow identify if two rows in two different SQL databases are the same, either because they were independently created or because of an earlier sync.
I think your idea with a hash value is fine. it should do the trick. However, you also could just concatenate the column values in a string and get the same result, maybe with a dash in between in case you have several data columns that would otherwise become ambiguous ("12-2" and "1-12" are then different)
But you still need to send over the generated hash values or concatenated strings of all rows in order to sync. Maybe it makes sense to track rows that are already synced? But then you may need to untrack them if updates of row data values happen.
I am not sure if this answer is helpful to you, because the question leaves many points open to speculation. Can I suggest to make it a bit more clear what you try to achieve?

Handle duplicates without removing them from database

I would like to know if there's some regular way to handle duplicates in the database without actually removing the duplicated rows. Or a specific name for what I'm trying to achieve, so I can check it out.
Why would I keep duplicates? Because I have to monitor them. I have to know that they're duplicates and are not e.g. searchable, but at the same time, I have to keep them, because I update the rows from external source and if I'd remove them, they'd go back to the database as soon as I update from external source.
I have two ideas:
Have an additional boolean column "searchable", but I feel it's a partial solution, it can turn out to be insufficient in the future
Have an additional column "duplicate_of". It would keep id of the column of which the row is duplicate. It would be a foreign key of the same table which is kind of weird., isn't it?
I know it's not a specific programming question, but I think that someone must have handled a similar situation (Facebook - Pages they keep track of those which are duplicates of others) and it would be great to know a verified solution.
EDIT: these are close duplicates, indetified mainly by their location (lat, lng), so DISTINCT is probably not a solution here
I would create a view that has DISTINCT values. Having an additional column to be searchable sounds tedious. Your second idea is actually more feasible and there is nothing weird about a self-referencing table.
The solution depends on several other factors. In particular, does the database support real deletes and updates (apart from setting the duplication information)?
You have a range of solutions. One is to place distinct values in a separate table, periodically. This works well if you have batch inserts, and no updates/deletes.
If you have a database that is being updated, then you might want to maintain a version number on the record. This lets you track it. Presumably, if it is a duplicate, there is another duplicate key inside it.
The problem with your second approach is that it can result in a tree-like structure of duplicates. Where A-->B-->C and D--> so A and D are duplicates, but this is not obvious. If you always put in the earliest value and there are no updates or deletes, then this solution is reasonable.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability

What is the best method/options for expiring records within a database?

In a lot of databases I seem to be working on these days I can't just delete a record for any number of reasons, including so later on they can be displayed later (say a product that no longer exists) or just keeping a history of what was.
So my question is how best to expire the record.
I have often added a date_expired column which is datetime field. Generally I query either where date_expired = 0 or date_expired = 0 OR date_expired > NOW() depending if the data is going to be expired in the future. Similar to this, I have also added a field call expired_flag. When this is set to true/1, the record is considered expired. This is the probably the easiest method, although you need to remember to include the expire clause any time you only want the current items.
Another method I have seen is moving the record to an archive table, but this can get quite messy when there are a large number of tables that require history tables. It also makes the retrieval of the value (say country) more difficult as you have to first do a left join (for example) and then do a second query to find the actual value (or redo the query with a modified left join).
Another option, which I haven't seen done nor have I fully attempted myself is to have a table that contains either all of the data from all of the expired records or some form of it--some kind of history table. In this case, retrieval would be even more difficult as you would need to search possibly a massive table and then parse the data.
Are there other solutions or modifications of these that are better?
I am using MySQL (with PHP), so I don't know if other databases have better methods to deal with this issue.
I prefer the date expired field method. However, sometimes it is useful to have two dates, both initial date, and date expired. Because if data can expire, it is often useful to know when it was active, and that means also knowing when it started existing.
I like the expired_flag option over the date_expired option, if query speed is important to you.
I think adding the date_expired column is the easiest and least invasive method. As long as your INSERTS and SELECTS use explicit column lists (they should be if they're not) then there is no impact to your existing CRUD operations. Add an index on the date_expired column and developers can add it as a property to any classes or logic that depend on the data in the existing table. All in all the best value for the effort. I agree that the other methods (i.e. archive tables) are troublesome at best, by comparison.
I usually don't like database triggers, since they can lead to strange "behind the scenes" behavior, but putting a trigger on delete to insert the about-to-be-deleted data into a history table might be an option.
In my experience, we usually just use an "Active" bit, or a "DateExpired" datetime like you mentioned. That works pretty well, and is really easy to deal with and query.
There's a related post here that offers a few other options. Maybe the CDC option?
SQL Server history table - populate through SP or Trigger?
May I also suggest adding a "Status" column that matches an enumerated type in the code you're using. Drop an index on the column and you'll be able to very easily and efficiently narrow down your returned data via your where clauses.
Some possible enumerated values to use, depending on your needs:
Active
Deleted
Suspended
InUse (Sort of a pseudo-locking mechanism)
Set the column up as an tinyint (that's SQL Server...not sure of the MySQL equivalent). You can also setup a matching lookup table with the key/value pairs and a foreign key constraint between the tables if you wish.
I've always used the ValidFrom, ValidTo approach where each table has these two additional fields. If ValidTo Is Null or > Now() then you know you have a valid record. In this way you can also add data to the table before it's live.
There are some fields that my tables usually have: creation_date, last_modification, last_modifier (fk to user), is_active (boolean or number, depending on the database).
Look at the "Slowly Changing Dimension" SCD algorithms. There are several choices from the Data Warehousing world that apply here.
None is "best" -- each responds to different requirements.
Here's a tidy summary.
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 4 is a variation on this moves the history to another table.
Type 2: A new record is added into the customer dimension table. To distinguish, a "valid date range" pair of columns in required. It helps to have a "this record is current" flag.
Type 3: The original record is modified to reflect the change.
In this case, there are columns for one or more previous values of the columns likely to change. This has an obvious limitation because it's bound to a specific number of columns. However, it is often used on conjunction with other types.
You can read more about this if you search for "Slowly Changing Dimension".
http://en.wikipedia.org/wiki/Slowly_Changing_Dimension
A very nice approach by Oracle to this problem is partitions. I don't think MySQL have something similar though.