SQL - Finding rows with unknown, but slightly similar, values? - mysql

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.

This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

Related

Grouping similar field data in MySQL

In MySQL, I have a table that accepts common data from multiple input channels and consists of ~100,000 rows.
One of the fields, stores the name of an employees functional manager. In the organisation, there are ~100 of these functional managers.
The issue I have is, as there are multiple input channels, different reporting systems have used a different name format for these managers.
For example, John Smith could be stored as;
John Smith
Smith, John
Smith John
This is a bit of nightmare now as we are looking to use this functional manager field as mechanism for reporting, which would mean we would need to sort or group by individual functional managers.
The data becomes legacy after each quarter, so we are happy to clean and format the functional manager field.
The question is, is there a simple way to do group these managers, even though their names are in different formats, I am looking for a way that does not involve me going one by one through each functional manager with a statement like this:
UPDATE tablename SET fm_name = "John Smith" where fm_name like "%John%" and fm_name like "Smith";
For example; programmatically, I could take the first record, break the name into its first and last name strings, then match similar records and update them. Then move to the next record. Is something like that possible in MySQL or would I be better to do that in the layer above.
Any suggestions would be greatly appreciated.
If you can come up with a normalizing function name_normalize(string) that yields George H. W. Bush given either that exact input or Bush, George H. W., then you can do
GROUP BY name_normalize(name)
and get what you want without mucking around with the data in your table.
This is such a function. It hacks around with MySQL's string functions. https://dev.mysql.com/doc/refman/5.7/en/string-functions.html
IF(LOCATE(',',#name1) = 0, --need to change?
#name1, -- no, return original
LEFT(CONCAT_WS(' ', -- yes, concatenate...
TRIM(SUBSTRING_INDEX(#name1, ',',-1)), -- after last ,
#name1), -- whole name
LENGTH( -- cut to original name length
REPLACE(#name1,',','')))) -- but without the comma
Substitute the name of your column for #name. And beware, this is sensitive to the number of spaces after the comma.
You'd be wise to define this function as a stored function. For one thing, you can handle the odd cases better. For another, it's kind of long to write in a query.

How to perform inexact matches on two data sets

i'm trying to compare two data sets (vendor masters) from two systems. we are moving to one system, so we want to avoid duplication. the issue is that the names, addresses, etc could be slightly different. for example, the name might end in 'Inc' or 'Inc.' or the address could be 'St' or 'Street'. the vendor masters have been dumped to excel, so i was thinking about pulling them into access to compare them, but i'm not sure how to handle the inexact matches. the data fields i need to compare are: name, address, telephone number, feder tax id (if populated), contact name
Here is how I would proceed. You will rarely get answers like this on Stack Exchange, since your question if not focused enough. This is a rather generic set of steps not specific to a particular tool (i.e. database or spreadsheet). As I said in my comments, you'll need to search for specific answers (or ask new ones) about the particular tools you use as you go. Without knowing all the details, Access can certainly be useful in doing some preliminary matching, but you could also utilize Excel directly or even Oracle SQL since you have it as a resource.
Back up your data.
Make a copy of your data for matching purposes.
Ensure that each record for both sets of data have a unique key (i.e. AutoNumber field or similar), so that until you have a confirmed match the records can always be separately identified.
Create new matched-key table and/or fields containing the list of matched unique key values.
Create new "matching" fields and copy your key fields into these new fields.
Scrub the data in all possible matching fields by
Removing periods and other punctuation
Choosing standard abbreviations and replacing all variations by the same value in all records. Example: replace "Incorporation" and "Inc." with "Inc"
Trim excess spaces from the end and between terms
Formatted all phone numbers exactly the same way, or better yet remove all space and punctuation for comparison purposes, excluding extension information: ##########
Parse and split multi-term fields into separate fields. Name -> First, Middle, Last Name fields; Address -> Street number, street name, extra address info.
The parsing process itself can identify and reconcile formatting differences.
Allows easier matching on terms separately.
Etc., etc.
Once the matching fields are sufficiently scrubbed, now match on the different fields.
Define matching priorities, that is which field or fields are likely to produce reliable matches with the least amount of uncertainty.
For records containing Tax ID numbers, that seems like the most logical place to start since an exact match on that number should be valid OR can indicate mistakes in your data.
For each type of match, update the matched-key fields mentioned above
For each successive matching query, exclude records that already have a match in the matched-key table/fields.
Refine and repeat all these steps until you are satisfied that all matches have been found.
Add all non-matched records to your final merged record set.
You never said how many records you have. If possible, it may be worth your organization's time to manually verify the automated matches by listing them side by side and manually tweaking them when needed.
But even if you successfully pair non-exact matches, someone still needs to make the decision of which record to keep for the merged system. I imagine you might have matches on company name and tax id--essentially verifying the match--but still have different addresses and/or contact name. There is no technical answer that will help you know which data to keep or discard. Once again, human review should be done to finalize the merged records. If you set this up correctly, a couple human eyeballs could probably go through thousands of record in just a day.

Intelligent Comparison based Update - Access / VBA

Need to intelligently perform updates on an access table.
Expert VBA / Intelligent Thinking would be required.
Table1 (For reference only)
CompanyCode Text
RegionCategory Number (1-99)
RegionCount Number(0 - 25000)
Table2
InvoiceNumber Number
CompanyCode Text
NumRows Number
RegionCode FourdigitNumber
ConfirmationRemark Y / N
Ourobjective is to put a Yes or No in the 'ConfirmationRemark' Column.
Rules :
1.Select only those InvoiceNumbers which have exactly two rows from Table2 and different RegionCode. These will have the same CompanyCode. RegionCategory is first two digits of RegionCode.
2.For these two Invoices - The difference between the two RegionCategory must be greater than two.
3.LookUp The RegionCount , from Table1
Decision Making :
We are now basically comparing two Invoices with different RegionCodes.
Idea is that , the Invoice with higher RegionCount is the one to be marked Yes.
1.The difference between RegionCount must be considerable. 'considerable' - I am trying to determine what would be the right number. Let us take 500 for now.
2.The Invoice with lower Region Count - should have RegionCount - Zero (bestCase) or very very low. If The Invoice with lower Region Count has a high RegionCount value > 200 , then we cannot successfully conclude.
3.NumRows , is prefered to be 1 or lesser than the other. This comparison , is not mandatory , hence we shall have a provision to not check for this. Mark the Other Invoice as 'N'
You have many ways to approach that type of complex update.
If you are lucky, you may be able to craft a SQL UPDATE statement that can include all the changes, but often you will have to resort to a combination of SELECT queries and custom VBA to filter them based on the results of calculations or lookups involving other data.
A few hints
Often, we tend to think about a problem in terms of 'what are the steps to get to the data that match the criteria'.
Sometimes, though, it's easier to turn the problem on its head and instead ask yourself 'what are the steps to get to the data that do not match the criteria'.
Because in your case the result is boolean, true or false, you could simply set the ConfirmationRemark field to True for all records and then update those that should be set to False, instead of the other way around.
Break down each step (as you did) and try to find the simplest SELECT query that will return just the data you need for that step. If a step is too complex, break it down further.
Compose your broken down SELECT statements together to slowly build a more complex query that tends toward your goal.
Once you have gone as far as you can, either construct an UPDATE Table2 SET ConfirmationRemark=True WHERE InvoiceNumber IN (SELECT InvoiceNumber ....) or use VBA to go through the recordset of results from your complext SELECT statement and do some more checks before you update the field in code.
Some issues
Unfortunately, despite your efforts to document your situation, there are not enough details for us to really help:
you do not mention which are the primary keys (from what you say, it seems that Table2 could have multiple records with identical InvoiceNumber)
the type of data you are dealing with is not obvious. You should include a sample of data and identify which ones should end-up with ConfirmationRemark set.
your problem is really too localised, meaning it is too specific to your to be of value to anyone else, although I think that with a bit more details your question could be of interest, if only to show an example of how to approach complex data updates in Access.

Select the records containing one or more words fully in UPPERCASE

I have a query in MYSql database. I have a table order_det, the table's column remarks_desc contains the entries as follows:
Table structure:
Table: order_det
Columns: rec_id, remarks_desc
Sample records in order_det table
rec_id remarks_desc
_________________________________________________________
1 a specific PROGRAMMING problem
2 A software Algorithm
3 software tools commonly USED by programmers
4 Practical, answerable problems that are unique to the programming profession
5 then you’re in the right place to ask your question
6 to see if your QUESTION has been asked BEFORE
My requirement I want to select only the records which that contains one more more words stored in all uppercase letters. From the above 6 records, I want to select only below 1,3,6 records:
rec_id remarks_desc
__________________________________________________
1 a specific PROGRAMMING problem (it contains one all uppercase word PROGRAMMING)
3 software tools commonly USED by programmers (it contains one all uppercase word USED)
6 to see if your QUESTION has been asked BEFORE (it contains two all uppercase words QUESTION and BEFORE)
I tried to archive this using LIKE, REGEXP but getting incorrect result.
Please help me to get the correct result.
Try:
SELECT rec_id, remarks_desc FROM order_det WHERE remarks_desc REGEXP '(^|[[:blank:]])[[:upper:]][[:upper:]]+([[:blank:]]|$)'
I have assumed that you want to exclude single-letter capitalised words. If you want to exclude capitalised words at the start of the string, you'll need to tweak the regex.
Make sure that your table collation is case sensitive (_cs not _ci)
I used information from http://dev.mysql.com/doc/refman/5.1/en/regexp.html#operator_regexp
However, if you're having to use regular expressions to extract data from a database, it's worth considering whether your database design could be improved.
This is particularly important if you need good performance from the database.
Here is the pretty straight forward stored function which returns amount of words in uppercase in row.
Cons:
it's stored function not pure SQL;
it uses collate
it uses regexp, but you can fill free to get rid of it using another inner loop for it;
it counts all words but you can add break if you reach 2.
Please find the function on the following link (gist.github.com). It doesn't display correctly here.

Any way to compare/match sentences with only a different word order?

I have 2 MySQL tables , each with address data of companies in it. One table is more recent, but has no telephone and no website data. Now I want to unite these tables into 1 recent and complete table.
But for some companies the order of the words is different,like this:
'Bakery Johnson' in table 1 and 'Johnson Bakery' in table 2.
Now I need to find a way to compare these values, as they're obviously the same company.
I think I will somehow have to split those names first, and then order the different parts alphabetically.
Any chance anybody has done something like this before, and willing to share some code or function?
UPDATE:
I found a function that sorts words inside a string. I can use this to detect name swaps as described above. It's quite SLOW though...
See : MySQL: how to sort the words in a string using a stored function?
If your table is MyISAM you can run this query:
SELECT *
FROM mytable
WHERE MATCH(name) AGAINST ('+bakery +johnson')
This will find all records containing the words bakery and johnson (and probably some other words too).
Creating a FULLTEXT index on the table:
CREATE FULLTEXT INDEX
fx_mytable_name
ON mytable (name)
will speed up this query.
Going back a bit on your solution, you could go with a similar way as modern phones resolve duplicate names conflicts
You present your user with the option, as he finds something suspicious:
Is this a duplicate? Use our [ Merge ] option
You are merging Bakery Johnson, please select the source/original item:
[ Johnson Bakery v ] (my amazing dropdown!)
Everything not already in Johnson Bakery gets ported to Bakery Johnson (orders for example), you may also show an intermediate screen displaying what will be merged, or let the user pick, for example, he wants the address info from Johnson Bakery and orders from both etc
It is not self correcting as you asked, but the collaboration from the users may be more accurate than AI here. I also love low-tech solutions like this so let us know what you ended up doing.